all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: "Mattias Engdegård" <mattiase@acm.org>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 33205@debbugs.gnu.org
Subject: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Wed, 7 Nov 2018 19:08:43 +0100	[thread overview]
Message-ID: <A3F938BE-DBD0-455D-9899-6CBA84ADF515@acm.org> (raw)
In-Reply-To: <83pnvjcvwc.fsf@gnu.org>

5 nov. 2018 kl. 17.49 skrev Eli Zaretskii <eliz@gnu.org>:
> After looking into this, my conclusion is that what I wrote above was
> not too wrong.  Indeed, currently [:ascii:]/[:nonascii:] cannot be
> distinguished from [:unibyte:]/[:multibyte:].  In a nutshell, it turns
> out [:unibyte:] is not what one might think it is, you can see that in
> re_wctype_to_bit, for example.

Thank you very much for taking your time to look at this, and for the detailed answer.
My apologies for severely complicating what I initially thought was quite a trifle!

> That ^[:ascii:] is not the same as [:nonascii:], and the same with
> [:unibyte:] vs ^[:multibyte:], is arguably a bug.  The reason for that
> becomes clear if you look at how we generate the fastmap in each of
> these cases and how we set the bits in the work-area of the range
> table, but I don't know enough to say how easy would it be to fix
> that.
> 
> An alternative is to use an explicit character class, as in \000-\377,
> that works as you'd expect.

I'm not sure what I expected [\000-\377] to mean in a multibyte string; one endpoint is ASCII and the other is a raw byte. It does work, as you noted, because two ranges are generated, as if written [\000-\177\200-\377].

In old Emacs versions (I tried 22.1.1), [:unibyte:] appears to include raw bytes in multibyte strings/buffers, and everything in unibyte strings/buffers (aka [\000-\377] in both cases), and [:multibyte:] the complement of that. Thus, at some point the behaviour changed, but I cannot find any NEWS reference to it. It could have been an accident.
Perhaps those char classes didn't see much use.

The old behaviour seems a little more intuitive, but it must be rare to need regex matching of rubbish bytes in multibyte strings. If you could argue that the status quo is fine then I wouldn't necessarily object, but would suggest that at least the code be made explicit about it (and the documentation, as well).

> Well, what do you think now?  Is it worth adding those to rx.el? I'm
> not sure.  How important is it to find unibyte characters in a string,
> anyway?

Unless we manage to make [:unibyte:]/[:multibyte:] more useful in their own right, it's fine to leave rx.el as is, as far as I'm concerned. There is no loss of expressivity.







  reply	other threads:[~2018-11-07 18:08 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-30 15:03 bug#33205: 26.1; unibyte/multibyte missing in rx.el Mattias Engdegård
2018-10-30 17:27 ` Eli Zaretskii
2018-10-31 15:27   ` Mattias Engdegård
2018-10-31 15:55     ` Eli Zaretskii
2018-11-05 16:49       ` Eli Zaretskii
2018-11-07 18:08         ` Mattias Engdegård [this message]
2018-11-07 19:10           ` Eli Zaretskii
2018-11-07 20:19             ` Mattias Engdegård
2018-11-19 20:07             ` Mattias Engdegård
2018-12-08  8:56               ` Eli Zaretskii
2018-12-08  9:23                 ` Mattias Engdegård
2018-12-08 11:11                   ` Eli Zaretskii
2018-12-28 18:17                 ` Mattias Engdegård
2018-12-29  9:24                   ` Eli Zaretskii
2018-12-29  9:23               ` Eli Zaretskii
2018-12-29 10:43                 ` Mattias Engdegård
2018-12-29 14:55                   ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=A3F938BE-DBD0-455D-9899-6CBA84ADF515@acm.org \
    --to=mattiase@acm.org \
    --cc=33205@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.