unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: "Mattias Engdegård" <mattiase@acm.org>
Cc: monnier@iro.umontreal.ca, 3687@debbugs.gnu.org
Subject: bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH]
Date: Fri, 28 Jun 2019 16:03:54 +0300	[thread overview]
Message-ID: <831rzdj1z9.fsf@gnu.org> (raw)
In-Reply-To: <E668D31C-1511-404B-AE6C-BBFB807B293E@acm.org> (message from Mattias Engdegård on Fri, 28 Jun 2019 14:41:51 +0200)

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 28 Jun 2019 14:41:51 +0200
> Cc: 3687@debbugs.gnu.org
> 
> Let's assume the following semantics as desirable:
> 
> 1. All characters and raw bytes (up to regexp syntax) match themselves no matter whether they are given as literals or in character alternatives.
> 2. All raw bytes C match themselves and nothing else no matter whether the pattern or target string/buffer are unibyte or multibyte.
> 3. Ranges from ASCII to raw bytes work as expected and do not contain Unicode characters above U+007F.
> 4. Ranges from non-ASCII Unicode characters to raw bytes make no sense and are treated as empty.
> 
> Here is a patch.

Thanks.

However, I don't want to look at the patch before we discuss and agree
on the principles.  So please consider expanding your principles to
answer the following questions:

 1. What do you mean by "raw bytes"?  Is #xab a raw byte or a Unicode
    point U+00AB?  IOW, how do we distinguish, in a regexp, between a
    raw byte and a character whose Unicode codepoint is that byte's
    value?  And how does one go about concocting a regexp that matches
    raw bytes in a unibyte or multibyte buffer or string?

 2. What is meant by "ranges from ASCII to raw bytes"?  Which
    characters are included in such ranges?

 3. If ranges from non-ASCII characters to raw bytes make no sense,
    how would one go about specifying a range that includes all the
    characters and raw bytes supported by Emacs?

When we discuss these issues, let's please be on the same page
regarding the handling of raw bytes in current Emacs.  Specifically:

  . Raw bytes are internally treated as "characters" whose Unicode
    codepoints are in the range [#x3fff00..#x3fffff].
  . The internal representation of raw bytes in buffers and strings
    uses 2-byte sequences that begin with #xc0 or #xc1.
  . Emacs jumps through hoops to never expose the above internals to
    th external world.  Thus, any encoding of a string with raw bytes
    will convert them to their single-byte representation, where they
    are indistinguishable from the characters which have the same
    codepoints, and many operations other than encoding also
    silently perform these conversions.





  reply	other threads:[~2019-06-28 13:03 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-26  9:56 bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps YAMAMOTO Mitsuharu
2009-06-26 13:43 ` Eli Zaretskii
2009-06-27  1:30   ` YAMAMOTO Mitsuharu
2009-06-27  9:36     ` Eli Zaretskii
2009-06-29  3:02       ` YAMAMOTO Mitsuharu
2009-06-29  8:47         ` Stefan Monnier
2009-07-24  1:08           ` YAMAMOTO Mitsuharu
2019-06-28 12:41 ` bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] Mattias Engdegård
2019-06-28 13:03   ` Eli Zaretskii [this message]
2019-06-28 14:05     ` Mattias Engdegård
2019-06-28 14:40       ` Eli Zaretskii
2019-06-28 15:00         ` Mattias Engdegård
2019-06-28 16:20           ` Eli Zaretskii
2019-06-28 16:47             ` Mattias Engdegård
2019-06-28 14:56       ` Eli Zaretskii
2019-06-28 15:18         ` Stefan Monnier
2019-06-28 15:34           ` Mattias Engdegård

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=831rzdj1z9.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=3687@debbugs.gnu.org \
    --cc=mattiase@acm.org \
    --cc=monnier@iro.umontreal.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).