From: Eli Zaretskii <eliz@gnu.org>
To: "Mattias Engdegård" <mattiase@acm.org>
Cc: monnier@iro.umontreal.ca, 3687@debbugs.gnu.org
Subject: bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH]
Date: Fri, 28 Jun 2019 16:03:54 +0300 [thread overview]
Message-ID: <831rzdj1z9.fsf@gnu.org> (raw)
In-Reply-To: <E668D31C-1511-404B-AE6C-BBFB807B293E@acm.org> (message from Mattias Engdegård on Fri, 28 Jun 2019 14:41:51 +0200)
> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 28 Jun 2019 14:41:51 +0200
> Cc: 3687@debbugs.gnu.org
>
> Let's assume the following semantics as desirable:
>
> 1. All characters and raw bytes (up to regexp syntax) match themselves no matter whether they are given as literals or in character alternatives.
> 2. All raw bytes C match themselves and nothing else no matter whether the pattern or target string/buffer are unibyte or multibyte.
> 3. Ranges from ASCII to raw bytes work as expected and do not contain Unicode characters above U+007F.
> 4. Ranges from non-ASCII Unicode characters to raw bytes make no sense and are treated as empty.
>
> Here is a patch.
Thanks.
However, I don't want to look at the patch before we discuss and agree
on the principles. So please consider expanding your principles to
answer the following questions:
1. What do you mean by "raw bytes"? Is #xab a raw byte or a Unicode
point U+00AB? IOW, how do we distinguish, in a regexp, between a
raw byte and a character whose Unicode codepoint is that byte's
value? And how does one go about concocting a regexp that matches
raw bytes in a unibyte or multibyte buffer or string?
2. What is meant by "ranges from ASCII to raw bytes"? Which
characters are included in such ranges?
3. If ranges from non-ASCII characters to raw bytes make no sense,
how would one go about specifying a range that includes all the
characters and raw bytes supported by Emacs?
When we discuss these issues, let's please be on the same page
regarding the handling of raw bytes in current Emacs. Specifically:
. Raw bytes are internally treated as "characters" whose Unicode
codepoints are in the range [#x3fff00..#x3fffff].
. The internal representation of raw bytes in buffers and strings
uses 2-byte sequences that begin with #xc0 or #xc1.
. Emacs jumps through hoops to never expose the above internals to
th external world. Thus, any encoding of a string with raw bytes
will convert them to their single-byte representation, where they
are indistinguishable from the characters which have the same
codepoints, and many operations other than encoding also
silently perform these conversions.
next prev parent reply other threads:[~2019-06-28 13:03 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-06-26 9:56 bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps YAMAMOTO Mitsuharu
2009-06-26 13:43 ` Eli Zaretskii
2009-06-27 1:30 ` YAMAMOTO Mitsuharu
2009-06-27 9:36 ` Eli Zaretskii
2009-06-29 3:02 ` YAMAMOTO Mitsuharu
2009-06-29 8:47 ` Stefan Monnier
2009-07-24 1:08 ` YAMAMOTO Mitsuharu
2019-06-28 12:41 ` bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] Mattias Engdegård
2019-06-28 13:03 ` Eli Zaretskii [this message]
2019-06-28 14:05 ` Mattias Engdegård
2019-06-28 14:40 ` Eli Zaretskii
2019-06-28 15:00 ` Mattias Engdegård
2019-06-28 16:20 ` Eli Zaretskii
2019-06-28 16:47 ` Mattias Engdegård
2019-06-28 14:56 ` Eli Zaretskii
2019-06-28 15:18 ` Stefan Monnier
2019-06-28 15:34 ` Mattias Engdegård
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=831rzdj1z9.fsf@gnu.org \
--to=eliz@gnu.org \
--cc=3687@debbugs.gnu.org \
--cc=mattiase@acm.org \
--cc=monnier@iro.umontreal.ca \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.