bug#33205: 26.1; unibyte/multibyte missing in rx.el

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
@ 2018-10-30 15:03 Mattias Engdegård
  2018-10-30 17:27 ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Mattias Engdegård @ 2018-10-30 15:03 UTC (permalink / raw)
  To: 33205

rx.el has constructs corresponding to all named regexp character
classes ([[:alnum:]], [[:digit:]], etc) except unibyte and multibyte.
This looks like a simple omission.

Or is it on purpose? The ascii and nonascii classes appear very
similar; I haven't been able to see any operational difference from
unibyte and multibyte, respectively. In fact, neither seem to work as
expected on unibyte strings or buffers:

(setq s "A\310")
"A\310"
(multibyte-string-p s)
nil
(string-match-p "A[[:nonascii:]]" s)
nil
(string-match-p "A[[:ascii:]]" s)
nil
(string-match-p "A[[:unibyte:]]" s)
nil
(string-match-p "A[[:multibyte:]]" s)
nil
(string-match-p "A." s)
0

What is going on here? ascii/nonascii and unibyte/multibyte are
supposed to be complementary; if both fail, it's because there is
nothing to match. Yet . matches.

(By the way, you may want to fix a trivial typo in a doc string in
rx.el while you are at it: `indian-tow-byte')

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
  2018-10-30 15:03 bug#33205: 26.1; unibyte/multibyte missing in rx.el Mattias Engdegård
@ 2018-10-30 17:27 ` Eli Zaretskii
  2018-10-31 15:27   ` Mattias Engdegård
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2018-10-30 17:27 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 33205

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Tue, 30 Oct 2018 16:03:28 +0100
> 
> (setq s "A\310")
> "A\310"
> (multibyte-string-p s)
> nil
> (string-match-p "A[[:nonascii:]]" s)
> nil
> (string-match-p "A[[:ascii:]]" s)
> nil
> (string-match-p "A[[:unibyte:]]" s)
> nil
> (string-match-p "A[[:multibyte:]]" s)
> nil
> (string-match-p "A." s)
> 0
> 
> What is going on here? ascii/nonascii and unibyte/multibyte are
> supposed to be complementary; if both fail, it's because there is
> nothing to match. Yet . matches.

I think it's a documentation bug: [:unibyte:] matches only ASCII
characters.  IOW, it tests "unibyteness" in the internal
representation (which might be surprising, I know).

And [:nonascii:] is only defined for multibyte characters.

> (By the way, you may want to fix a trivial typo in a doc string in
> rx.el while you are at it: `indian-tow-byte')

Done, thanks.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
  2018-10-30 17:27 ` Eli Zaretskii
@ 2018-10-31 15:27   ` Mattias Engdegård
  2018-10-31 15:55     ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Mattias Engdegård @ 2018-10-31 15:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 33205

tis 2018-10-30 klockan 19:27 +0200 skrev Eli Zaretskii:
> I think it's a documentation bug: [:unibyte:] matches only ASCII
> characters.  IOW, it tests "unibyteness" in the internal
> representation (which might be surprising, I know).
> 
> And [:nonascii:] is only defined for multibyte characters.

Thus [:ascii:]/[:nonascii:] cannot be distinguished from
[:unibyte:]/[:multibyte:]. Surely this cannot have been the intention?
Perhaps it's a relic from an earlier implementation. The code certainly
differs (IS_REAL_ASCII vs ISUNIBYTE).

Taking a step back: Do you agree that the missing unibyte/multibyte
should be added to rx, or do you feel that their current relative
uselessness would have them better stay out of it? (I'm neutral on the
subject.)

If there is a useful interpretation of [:unibyte:]/[:multibyte:] today,
perhaps we could make them behave that way. 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
  2018-10-31 15:27   ` Mattias Engdegård
@ 2018-10-31 15:55     ` Eli Zaretskii
  2018-11-05 16:49       ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2018-10-31 15:55 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 33205

> From: Mattias Engdegård <mattiase@acm.org>
> Cc: 33205@debbugs.gnu.org
> Date: Wed, 31 Oct 2018 16:27:53 +0100
> 
> tis 2018-10-30 klockan 19:27 +0200 skrev Eli Zaretskii:
> > I think it's a documentation bug: [:unibyte:] matches only ASCII
> > characters.  IOW, it tests "unibyteness" in the internal
> > representation (which might be surprising, I know).
> > 
> > And [:nonascii:] is only defined for multibyte characters.
> 
> Thus [:ascii:]/[:nonascii:] cannot be distinguished from
> [:unibyte:]/[:multibyte:]. Surely this cannot have been the intention?

I actually looked into this some more, and I think my original
conclusion was wrong.  Let me dwell on that a bit more, and I will
report what I found.  We can then revisit the questions you ask above.

> Taking a step back: Do you agree that the missing unibyte/multibyte
> should be added to rx

I think it depends on what we find regarding the functionality.  It's
possible that it makes no real sense in the context of rx, for example
(although it indeed sounds like an omission).

> If there is a useful interpretation of [:unibyte:]/[:multibyte:] today,
> perhaps we could make them behave that way. 

Right.  Stay tuned, and thanks for pointing out this surprising
behavior.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
  2018-10-31 15:55     ` Eli Zaretskii
@ 2018-11-05 16:49       ` Eli Zaretskii
  2018-11-07 18:08         ` Mattias Engdegård
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2018-11-05 16:49 UTC (permalink / raw)
  To: mattiase; +Cc: 33205

> Date: Wed, 31 Oct 2018 17:55:08 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 33205@debbugs.gnu.org
> 
> > From: Mattias Engdegård <mattiase@acm.org>
> > Cc: 33205@debbugs.gnu.org
> > Date: Wed, 31 Oct 2018 16:27:53 +0100
> > 
> > tis 2018-10-30 klockan 19:27 +0200 skrev Eli Zaretskii:
> > > I think it's a documentation bug: [:unibyte:] matches only ASCII
> > > characters.  IOW, it tests "unibyteness" in the internal
> > > representation (which might be surprising, I know).
> > > 
> > > And [:nonascii:] is only defined for multibyte characters.
> > 
> > Thus [:ascii:]/[:nonascii:] cannot be distinguished from
> > [:unibyte:]/[:multibyte:]. Surely this cannot have been the intention?
> 
> I actually looked into this some more, and I think my original
> conclusion was wrong.  Let me dwell on that a bit more, and I will
> report what I found.  We can then revisit the questions you ask above.

After looking into this, my conclusion is that what I wrote above was
not too wrong.  Indeed, currently [:ascii:]/[:nonascii:] cannot be
distinguished from [:unibyte:]/[:multibyte:].  In a nutshell, it turns
out [:unibyte:] is not what one might think it is, you can see that in
re_wctype_to_bit, for example.

Thinking about this and looking at the code, I'd say that support of
named character classes is heavily biased in favor of multibyte text,
not to say supports _only_ multibyte text.  So searching unibyte
strings and unibyte buffers for the likes of [:unibyte:] will only
find ASCII characters.

In multibyte buffers and strings, unibyte characters are stored in
their multibyte representation, so it is no longer trivial to define
what does [:unibyte:] mean.  However, I discovered that there's a
workaround for what you are trying to do: use ^[:multibyte:] instead
of [:unibyte:].  Observe:

  (setq s "A\310") => "A\310"
  (string-match-p "A[[:ascii:]]" s) => nil
  (string-match-p "A[[:nonascii:]]" s) => nil
  (string-match-p "A[^[:ascii:]]" s) => 0      ;; !!!
  (string-match-p "A[[:unibyte:]]" s) => nil
  (string-match-p "A[^[:multibyte:]]" s) => 0  ;; !!!

That ^[:ascii:] is not the same as [:nonascii:], and the same with
[:unibyte:] vs ^[:multibyte:], is arguably a bug.  The reason for that
becomes clear if you look at how we generate the fastmap in each of
these cases and how we set the bits in the work-area of the range
table, but I don't know enough to say how easy would it be to fix
that.

An alternative is to use an explicit character class, as in \000-\377,
that works as you'd expect.

> > Taking a step back: Do you agree that the missing unibyte/multibyte
> > should be added to rx
> 
> I think it depends on what we find regarding the functionality.  It's
> possible that it makes no real sense in the context of rx, for example
> (although it indeed sounds like an omission).
> 
> > If there is a useful interpretation of [:unibyte:]/[:multibyte:] today,
> > perhaps we could make them behave that way. 
> 
> Right.  Stay tuned, and thanks for pointing out this surprising
> behavior.

Well, what do you think now?  Is it worth adding those to rx.el?  I'm
not sure.  How important is it to find unibyte characters in a string,
anyway?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
  2018-11-05 16:49       ` Eli Zaretskii
@ 2018-11-07 18:08         ` Mattias Engdegård
  2018-11-07 19:10           ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Mattias Engdegård @ 2018-11-07 18:08 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 33205

5 nov. 2018 kl. 17.49 skrev Eli Zaretskii <eliz@gnu.org>:
> After looking into this, my conclusion is that what I wrote above was
> not too wrong.  Indeed, currently [:ascii:]/[:nonascii:] cannot be
> distinguished from [:unibyte:]/[:multibyte:].  In a nutshell, it turns
> out [:unibyte:] is not what one might think it is, you can see that in
> re_wctype_to_bit, for example.

Thank you very much for taking your time to look at this, and for the detailed answer.
My apologies for severely complicating what I initially thought was quite a trifle!

> That ^[:ascii:] is not the same as [:nonascii:], and the same with
> [:unibyte:] vs ^[:multibyte:], is arguably a bug.  The reason for that
> becomes clear if you look at how we generate the fastmap in each of
> these cases and how we set the bits in the work-area of the range
> table, but I don't know enough to say how easy would it be to fix
> that.
> 
> An alternative is to use an explicit character class, as in \000-\377,
> that works as you'd expect.

I'm not sure what I expected [\000-\377] to mean in a multibyte string; one endpoint is ASCII and the other is a raw byte. It does work, as you noted, because two ranges are generated, as if written [\000-\177\200-\377].

In old Emacs versions (I tried 22.1.1), [:unibyte:] appears to include raw bytes in multibyte strings/buffers, and everything in unibyte strings/buffers (aka [\000-\377] in both cases), and [:multibyte:] the complement of that. Thus, at some point the behaviour changed, but I cannot find any NEWS reference to it. It could have been an accident.
Perhaps those char classes didn't see much use.

The old behaviour seems a little more intuitive, but it must be rare to need regex matching of rubbish bytes in multibyte strings. If you could argue that the status quo is fine then I wouldn't necessarily object, but would suggest that at least the code be made explicit about it (and the documentation, as well).

> Well, what do you think now?  Is it worth adding those to rx.el? I'm
> not sure.  How important is it to find unibyte characters in a string,
> anyway?

Unless we manage to make [:unibyte:]/[:multibyte:] more useful in their own right, it's fine to leave rx.el as is, as far as I'm concerned. There is no loss of expressivity.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
  2018-11-07 18:08         ` Mattias Engdegård
@ 2018-11-07 19:10           ` Eli Zaretskii
  2018-11-07 20:19             ` Mattias Engdegård
  2018-11-19 20:07             ` Mattias Engdegård
  0 siblings, 2 replies; 17+ messages in thread
From: Eli Zaretskii @ 2018-11-07 19:10 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 33205

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Wed, 7 Nov 2018 19:08:43 +0100
> Cc: 33205@debbugs.gnu.org
> 
> I'm not sure what I expected [\000-\377] to mean in a multibyte string; one endpoint is ASCII and the other is a raw byte. It does work, as you noted, because two ranges are generated, as if written [\000-\177\200-\377].

Octal escapes usually mean raw bytes.  Cf the fact that you used \310
in your original recipe.  So the above is expected to match raw bytes
and ASCII characters, i.e. what [:unibyte:] should probably stand for.

> In old Emacs versions (I tried 22.1.1), [:unibyte:] appears to include raw bytes in multibyte strings/buffers, and everything in unibyte strings/buffers (aka [\000-\377] in both cases), and [:multibyte:] the complement of that. Thus, at some point the behaviour changed, but I cannot find any NEWS reference to it. It could have been an accident.

Almost everything regarding internals of unibyte and multibyte
characters changed when we switched to UTF-8 superset as internal
representation.  One consequence is that raw bytes are no longer
represented as themselves in buffers and strings.

> Perhaps those char classes didn't see much use.

Definitely not.  I cannot even think of a practical use case for them
nowadays.

> The old behaviour seems a little more intuitive, but it must be rare to need regex matching of rubbish bytes in multibyte strings. If you could argue that the status quo is fine then I wouldn't necessarily object, but would suggest that at least the code be made explicit about it (and the documentation, as well).

I can fix the docs, but I don't think I understand what would you like
to do about the code.

> > Well, what do you think now?  Is it worth adding those to rx.el? I'm
> > not sure.  How important is it to find unibyte characters in a string,
> > anyway?
> 
> Unless we manage to make [:unibyte:]/[:multibyte:] more useful in their own right, it's fine to leave rx.el as is, as far as I'm concerned. There is no loss of expressivity.

OK.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
  2018-11-07 19:10           ` Eli Zaretskii
@ 2018-11-07 20:19             ` Mattias Engdegård
  2018-11-19 20:07             ` Mattias Engdegård
  1 sibling, 0 replies; 17+ messages in thread
From: Mattias Engdegård @ 2018-11-07 20:19 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 33205

7 nov. 2018 kl. 20.10 skrev Eli Zaretskii <eliz@gnu.org>:
>> Perhaps those char classes didn't see much use.
> 
> Definitely not.  I cannot even think of a practical use case for them
> nowadays.

But were they useful back then, when they were added? If so, for what? Maybe it's been lost in the mists of time.

> 
>> The old behaviour seems a little more intuitive, but it must be rare to need regex matching of rubbish bytes in multibyte strings. If you could argue that the status quo is fine then I wouldn't necessarily object, but would suggest that at least the code be made explicit about it (and the documentation, as well).
> 
> I can fix the docs, but I don't think I understand what would you like
> to do about the code.

If we are content with [:unibyte:]/[:multibyte:] = [:ascii:]/[:nonascii:], then it would be nice if the code were obvious about it. Right now, ISUNIBYTE and IS_REAL_ASCII differ, and it takes some digging to realise that they have the same effect. Removing RECC_UNIBYTE/RECC_MULTIBYTE entirely and use RECC_ASCII/RECC_NONASCII throughout would make the semantics clear.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
  2018-11-07 19:10           ` Eli Zaretskii
  2018-11-07 20:19             ` Mattias Engdegård
@ 2018-11-19 20:07             ` Mattias Engdegård
  2018-12-08  8:56               ` Eli Zaretskii
  2018-12-29  9:23               ` Eli Zaretskii
  1 sibling, 2 replies; 17+ messages in thread
From: Mattias Engdegård @ 2018-11-19 20:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 33205

[-- Attachment #1: Type: text/plain, Size: 652 bytes --]

I tried using rx to match raw bytes. (rx (any (?\200 . ?\377))) doesn't work, since that is translated to the corresponding Unicode range; (any (#x3fff80 . #x3fffff)) must be used instead. Maybe that is evident, or would it merit a mention in the doc string?

The alternative formulation (rx (any "\200-\377")) doesn't work either, and this seems to be a bug. Looking at rx-check-any-string, a second bug is revealed: the code uses the regex ".-." to pick out ranges, which means that \n cannot be a range endpoint.

Perhaps you want me to open a new bug for the above? I'm attaching a patch all the same, but you may prefer doing it differently.

[-- Attachment #2: rx-any-raw-bytes.patch --]
[-- Type: application/octet-stream, Size: 3680 bytes --]

 lisp/emacs-lisp/rx.el            | 49 ++++++++++++++++++++++------------------
 test/lisp/emacs-lisp/rx-tests.el | 22 ++++++++++++++++++
 2 files changed, 49 insertions(+), 22 deletions(-)

diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 1230df4f15..04069e8d50 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -449,28 +449,33 @@ Only both edges of each range is checked."
 
 
 (defun rx-check-any-string (str)
-  "Check string argument STR for Rx `any'."
-  (let ((i 0)
-	c1 c2 l)
-    (if (= 0 (length str))
-	(error "String arg for Rx `any' must not be empty"))
-    (while (string-match ".-." str i)
-      ;; string before range: convert it to characters
-      (if (< i (match-beginning 0))
-	  (setq l (nconc
-		   l
-		   (append (substring str i (match-beginning 0)) nil))))
-      ;; range
-      (setq i (match-end 0)
-	    c1 (aref str (match-beginning 0))
-	    c2 (aref str (1- i)))
-      (cond
-       ((< c1 c2) (setq l (nconc l (list (cons c1 c2)))))
-       ((= c1 c2) (setq l (nconc l (list c1))))))
-    ;; rest?
-    (if (< i (length str))
-	(setq l (nconc l (append (substring str i) nil))))
-    l))
+  "Turn a string argument to `any' into a list of characters and, representing
+ranges, dotted pairs of characters. The original order is not preserved."
+  (let ((decode-char
+         ;; Make sure raw bytes are decoded as such, to avoid confusion with
+         ;; U+0080..U+00FF.
+         (if (multibyte-string-p str)
+             #'identity
+           (lambda (c) (if (and (>= c #x80) (<= c #xff))
+                           (+ c #x3fff00)
+                         c))))
+        (len (length str))
+        (i 0)
+        (ret nil))
+    (while (< i len)
+      (cond ((and (< i (- len 2))
+                  (= (aref str (+ i 1)) ?-))
+             ;; Range.
+             (let ((start (funcall decode-char (aref str i)))
+                   (end   (funcall decode-char (aref str (+ i 2)))))
+               (cond ((< start end) (push (cons start end) ret))
+                     ((= start end) (push start ret)))
+               (setq i (+ i 3))))
+            (t
+             ;; Single character.
+             (push (funcall decode-char (aref str i)) ret)
+             (setq i (+ i 1)))))
+    ret))
 
 
 (defun rx-check-any (arg)
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index d15e3d7719..fb268c58f9 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -33,6 +33,28 @@
                                   (number-sequence ?< ?\])
                                   (number-sequence ?- ?:))))))
 
+(ert-deftest rx-char-any-range-nl ()
+  "Character alternatives with \n as a range endpoint."
+  (should (equal (rx (any "\n-\r"))
+                 "[\n-\r]"))
+  (should (equal (rx (any "\a-\n"))
+                 "[\a-\n]")))
+
+(ert-deftest rx-char-any-raw-byte ()
+  "Raw bytes in character alternatives."
+  ;; Separate raw characters.
+  (should (equal (string-match-p (rx (any "\326A\333B"))
+                                 "X\326\333")
+                 1))
+  ;; Range of raw characters, unibyte.
+  (should (equal (string-match-p (rx (any "\200-\377"))
+                                 "ÿA\310B")
+                 2))
+  ;; Range of raw characters, multibyte.
+  (should (equal (string-match-p (rx (any "Å\211\326-\377\177"))
+                                 "XY\355\177\327")
+                 2)))
+
 (ert-deftest rx-pcase ()
   (should (equal (pcase "a 1 2 3 1 1 b"
                    ((rx (let u (+ digit)) space

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
  2018-11-19 20:07             ` Mattias Engdegård
@ 2018-12-08  8:56               ` Eli Zaretskii
  2018-12-08  9:23                 ` Mattias Engdegård
  2018-12-28 18:17                 ` Mattias Engdegård
  2018-12-29  9:23               ` Eli Zaretskii
  1 sibling, 2 replies; 17+ messages in thread
From: Eli Zaretskii @ 2018-12-08  8:56 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 33205

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Mon, 19 Nov 2018 21:07:39 +0100
> Cc: 33205@debbugs.gnu.org
> 
> I tried using rx to match raw bytes. (rx (any (?\200 . ?\377))) doesn't work, since that is translated to the corresponding Unicode range; (any (#x3fff80 . #x3fffff)) must be used instead. Maybe that is evident, or would it merit a mention in the doc string?
> 
> The alternative formulation (rx (any "\200-\377")) doesn't work either, and this seems to be a bug. Looking at rx-check-any-string, a second bug is revealed: the code uses the regex ".-." to pick out ranges, which means that \n cannot be a range endpoint.
> 
> Perhaps you want me to open a new bug for the above? I'm attaching a patch all the same, but you may prefer doing it differently.

Thanks.  For a patch of this size, we would need a copyright
assignment from you.  Would you like to start the legal paperwork for
that?





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
  2018-12-08  8:56               ` Eli Zaretskii
@ 2018-12-08  9:23                 ` Mattias Engdegård
  2018-12-08 11:11                   ` Eli Zaretskii
  2018-12-28 18:17                 ` Mattias Engdegård
  1 sibling, 1 reply; 17+ messages in thread
From: Mattias Engdegård @ 2018-12-08  9:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 33205

8 dec. 2018 kl. 09.56 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> Thanks.  For a patch of this size, we would need a copyright
> assignment from you.  Would you like to start the legal paperwork for
> that?

Yes please. Will you help me, or shall I contact someone in particular?






^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
  2018-12-08  9:23                 ` Mattias Engdegård
@ 2018-12-08 11:11                   ` Eli Zaretskii
  0 siblings, 0 replies; 17+ messages in thread
From: Eli Zaretskii @ 2018-12-08 11:11 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 33205

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 8 Dec 2018 10:23:18 +0100
> Cc: 33205@debbugs.gnu.org
> 
> 8 dec. 2018 kl. 09.56 skrev Eli Zaretskii <eliz@gnu.org>:
> > 
> > Thanks.  For a patch of this size, we would need a copyright
> > assignment from you.  Would you like to start the legal paperwork for
> > that?
> 
> Yes please. Will you help me, or shall I contact someone in particular?

Thanks.  Form and instructions sent off-list.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
  2018-12-08  8:56               ` Eli Zaretskii
  2018-12-08  9:23                 ` Mattias Engdegård
@ 2018-12-28 18:17                 ` Mattias Engdegård
  2018-12-29  9:24                   ` Eli Zaretskii
  1 sibling, 1 reply; 17+ messages in thread
From: Mattias Engdegård @ 2018-12-28 18:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 33205

lör 2018-12-08 klockan 10:56 +0200 skrev Eli Zaretskii:
> 
> Thanks.  For a patch of this size, we would need a copyright
> assignment from you.  Would you like to start the legal paperwork for
> that?

The paperwork has now been processed, I'm told.







^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
  2018-11-19 20:07             ` Mattias Engdegård
  2018-12-08  8:56               ` Eli Zaretskii
@ 2018-12-29  9:23               ` Eli Zaretskii
  2018-12-29 10:43                 ` Mattias Engdegård
  1 sibling, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2018-12-29  9:23 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 33205

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Mon, 19 Nov 2018 21:07:39 +0100
> Cc: 33205@debbugs.gnu.org
> 
> Perhaps you want me to open a new bug for the above? I'm attaching a patch all the same, but you may prefer doing it differently.

Thanks, I have some comments.

First, please provide a ChangeLog-style commit log message describing
the changes.  See CONTRIBUTE for more details.

> +  "Turn a string argument to `any' into a list of characters and, representing
> +ranges, dotted pairs of characters. The original order is not preserved."

The first line of a doc string should be a single full sentence, and
it should mention the arguments of the function.  Also, the first
sentence confused me: what do you mean by this part:

 "... and, representing ranges, dotted pairs of characters"

Finally, please use the US English convention of leaving 2 spaces
between sentences in the documentation.

> +  (let ((decode-char
> +         ;; Make sure raw bytes are decoded as such, to avoid confusion with
> +         ;; U+0080..U+00FF.
> +         (if (multibyte-string-p str)
> +             #'identity
> +           (lambda (c) (if (and (>= c #x80) (<= c #xff))
> +                           (+ c #x3fff00)
> +                         c))))
> +        (len (length str))
> +        (i 0)
> +        (ret nil))
> +    (while (< i len)
> +      (cond ((and (< i (- len 2))
> +                  (= (aref str (+ i 1)) ?-))
> +             ;; Range.
> +             (let ((start (funcall decode-char (aref str i)))
> +                   (end   (funcall decode-char (aref str (+ i 2)))))
> +               (cond ((< start end) (push (cons start end) ret))
> +                     ((= start end) (push start ret)))
> +               (setq i (+ i 3))))
> +            (t
> +             ;; Single character.
> +             (push (funcall decode-char (aref str i)) ret)
> +             (setq i (+ i 1)))))
> +    ret))

This seems to have dropped the validity check which signaled an error
in the original code?  Any reason for that?

Thanks.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
  2018-12-28 18:17                 ` Mattias Engdegård
@ 2018-12-29  9:24                   ` Eli Zaretskii
  0 siblings, 0 replies; 17+ messages in thread
From: Eli Zaretskii @ 2018-12-29  9:24 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 33205

> From: Mattias Engdegård <mattiase@acm.org>
> Cc: 33205@debbugs.gnu.org
> Date: Fri, 28 Dec 2018 19:17:55 +0100
> 
> lör 2018-12-08 klockan 10:56 +0200 skrev Eli Zaretskii:
> > 
> > Thanks.  For a patch of this size, we would need a copyright
> > assignment from you.  Would you like to start the legal paperwork for
> > that?
> 
> The paperwork has now been processed, I'm told.

Right, I posted a few minor comments; after fixing this, we can
install your changes.

Thanks.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
  2018-12-29  9:23               ` Eli Zaretskii
@ 2018-12-29 10:43                 ` Mattias Engdegård
  2018-12-29 14:55                   ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Mattias Engdegård @ 2018-12-29 10:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 33205

[-- Attachment #1: Type: text/plain, Size: 763 bytes --]

29 dec. 2018 kl. 10.23 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> First, please provide a ChangeLog-style commit log message describing
> the changes.  See CONTRIBUTE for more details.

Done.

> The first line of a doc string should be a single full sentence, and
> it should mention the arguments of the function.  Also, the first
> sentence confused me: what do you mean by this part:
> 
> "... and, representing ranges, dotted pairs of characters"
> 
> Finally, please use the US English convention of leaving 2 spaces
> between sentences in the documentation.

All done.

> This seems to have dropped the validity check which signaled an error
> in the original code?  Any reason for that?

Just an oversight; check reinstated.

Thank you; new patch attached.


[-- Attachment #2: 0001-Handle-raw-bytes-and-LF-in-ranges-in-rx-any-argument.patch --]
[-- Type: application/octet-stream, Size: 4242 bytes --]

From f96258853db47aee2935e5e07876bb97fbed2640 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Sat, 29 Dec 2018 11:09:27 +0100
Subject: [PATCH] Handle raw bytes, and LF in ranges, in rx `any' argument
 strings

* lisp/emacs-lisp/rx.el (rx-check-any-string): Rewrite to handle raw bytes
in unibyte strings and accept LF as range endpoints (Bug#33205).
* test/lisp/emacs-lisp/rx-tests.el: Add tests for the above.
---
 lisp/emacs-lisp/rx.el            | 51 ++++++++++++++++++--------------
 test/lisp/emacs-lisp/rx-tests.el | 22 ++++++++++++++
 2 files changed, 51 insertions(+), 22 deletions(-)

diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 1230df4f15..1cae22f870 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -449,28 +449,35 @@ Only both edges of each range is checked."
 
 
 (defun rx-check-any-string (str)
-  "Check string argument STR for Rx `any'."
-  (let ((i 0)
-	c1 c2 l)
-    (if (= 0 (length str))
-	(error "String arg for Rx `any' must not be empty"))
-    (while (string-match ".-." str i)
-      ;; string before range: convert it to characters
-      (if (< i (match-beginning 0))
-	  (setq l (nconc
-		   l
-		   (append (substring str i (match-beginning 0)) nil))))
-      ;; range
-      (setq i (match-end 0)
-	    c1 (aref str (match-beginning 0))
-	    c2 (aref str (1- i)))
-      (cond
-       ((< c1 c2) (setq l (nconc l (list (cons c1 c2)))))
-       ((= c1 c2) (setq l (nconc l (list c1))))))
-    ;; rest?
-    (if (< i (length str))
-	(setq l (nconc l (append (substring str i) nil))))
-    l))
+  "Turn the `any' argument string STR into a list of characters.
+The original order is not preserved.  Ranges, \"A-Z\", become pairs, (?A . ?Z)."
+  (let ((decode-char
+         ;; Make sure raw bytes are decoded as such, to avoid confusion with
+         ;; U+0080..U+00FF.
+         (if (multibyte-string-p str)
+             #'identity
+           (lambda (c) (if (<= #x80 c #xff)
+                           (+ c #x3fff00)
+                         c))))
+        (len (length str))
+        (i 0)
+        (ret nil))
+    (if (= 0 len)
+        (error "String arg for Rx `any' must not be empty"))
+    (while (< i len)
+      (cond ((and (< i (- len 2))
+                  (= (aref str (+ i 1)) ?-))
+             ;; Range.
+             (let ((start (funcall decode-char (aref str i)))
+                   (end   (funcall decode-char (aref str (+ i 2)))))
+               (cond ((< start end) (push (cons start end) ret))
+                     ((= start end) (push start ret)))
+               (setq i (+ i 3))))
+            (t
+             ;; Single character.
+             (push (funcall decode-char (aref str i)) ret)
+             (setq i (+ i 1)))))
+    ret))
 
 
 (defun rx-check-any (arg)
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index d15e3d7719..8b3ce6cb01 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -33,6 +33,28 @@
                                   (number-sequence ?< ?\])
                                   (number-sequence ?- ?:))))))
 
+(ert-deftest rx-char-any-range-nl ()
+  "Test character alternatives with LF as a range endpoint."
+  (should (equal (rx (any "\n-\r"))
+                 "[\n-\r]"))
+  (should (equal (rx (any "\a-\n"))
+                 "[\a-\n]")))
+
+(ert-deftest rx-char-any-raw-byte ()
+  "Test raw bytes in character alternatives."
+  ;; Separate raw characters.
+  (should (equal (string-match-p (rx (any "\326A\333B"))
+                                 "X\326\333")
+                 1))
+  ;; Range of raw characters, unibyte.
+  (should (equal (string-match-p (rx (any "\200-\377"))
+                                 "ÿA\310B")
+                 2))
+  ;; Range of raw characters, multibyte.
+  (should (equal (string-match-p (rx (any "Å\211\326-\377\177"))
+                                 "XY\355\177\327")
+                 2)))
+
 (ert-deftest rx-pcase ()
   (should (equal (pcase "a 1 2 3 1 1 b"
                    ((rx (let u (+ digit)) space
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* bug#33205: 26.1; unibyte/multibyte missing in rx.el
  2018-12-29 10:43                 ` Mattias Engdegård
@ 2018-12-29 14:55                   ` Eli Zaretskii
  0 siblings, 0 replies; 17+ messages in thread
From: Eli Zaretskii @ 2018-12-29 14:55 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 33205-done

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 29 Dec 2018 11:43:56 +0100
> Cc: 33205@debbugs.gnu.org
> 
> Thank you; new patch attached.

Thanks, pushed to the master branch.





^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-12-29 14:55 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-10-30 15:03 bug#33205: 26.1; unibyte/multibyte missing in rx.el Mattias Engdegård
2018-10-30 17:27 ` Eli Zaretskii
2018-10-31 15:27   ` Mattias Engdegård
2018-10-31 15:55     ` Eli Zaretskii
2018-11-05 16:49       ` Eli Zaretskii
2018-11-07 18:08         ` Mattias Engdegård
2018-11-07 19:10           ` Eli Zaretskii
2018-11-07 20:19             ` Mattias Engdegård
2018-11-19 20:07             ` Mattias Engdegård
2018-12-08  8:56               ` Eli Zaretskii
2018-12-08  9:23                 ` Mattias Engdegård
2018-12-08 11:11                   ` Eli Zaretskii
2018-12-28 18:17                 ` Mattias Engdegård
2018-12-29  9:24                   ` Eli Zaretskii
2018-12-29  9:23               ` Eli Zaretskii
2018-12-29 10:43                 ` Mattias Engdegård
2018-12-29 14:55                   ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).