* possible bugfix for CHAR_LEADING_CODE() for multibyte raw bytes
@ 2024-04-28 10:14 Danny McClanahan
2024-04-28 14:20 ` Eli Zaretskii
0 siblings, 1 reply; 3+ messages in thread
From: Danny McClanahan @ 2024-04-28 10:14 UTC (permalink / raw)
To: emacs-devel@gnu.org
Hello emacs-devel,
I have been getting familiar with the super simple and pragmatic Emacs multibyte encoding scheme and the work done in regex-emacs.c to manipulate it. I am now pretty confident that a byte-based automaton should be able to match against Emacs multibyte buffers (which iiuc are what most user-visible buffers are encoded in). This is super useful because it means:
(1) Emacs can make use of other external libraries which perform byte string matching without enforcing UTF-8 encoding (not just the one I'm working on).
(2) The matching loop can iterate directly over bytes in many cases instead of counting multibyte codepoints (I believe regex-emacs.c counts multibyte codepoints in more cases than necessary, but not sure of this yet).
While implementing the multibyte encoding scheme, I found what seemed to be an error in `CHAR_LEADING_CODE()` for raw bytes (multibyte codepoints encoded using two bytes which encode a single raw byte), where the multibyte code `c` isn't shifted downward the way it is in `char_string()` (see diff for fix at bottom).
This seems never to be triggered because regex-emacs.c never calls `CHAR_LEADING_CODE()` when generating the fastmap for raw bytes (e.g. lines 3095-3098):
```c
/* Cover the case of matching a raw char in a
multibyte regexp against unibyte. */
if (CHAR_BYTE8_HEAD_P (p[1]))
data->fastmap[CHAR_TO_BYTE8 (STRING_CHAR (p + 1))] = 1;
```
Not sure if I've got this right, so please correct me if not!
(Note: the below patch requires multiple trailing newlines in order to apply successfully, so I've enclosed it within "-----". I'm not sure how to avoid this: I just used `git diff master..`.)
-------------------------
diff --git a/src/character.h b/src/character.h
index 6d0f035c2bb..d7a6a4f525c 100644
--- a/src/character.h
+++ b/src/character.h
@@ -216,7 +216,7 @@ CHAR_LEADING_CODE (int c)
: c <= MAX_3_BYTE_CHAR ? 0xE0 | (c >> 12)
: c <= MAX_4_BYTE_CHAR ? 0xF0 | (c >> 18)
: c <= MAX_5_BYTE_CHAR ? 0xF8
- : 0xC0 | ((c >> 6) & 0x01));
+ : 0xC0 | ((CHAR_TO_BYTE8(c) >> 6) & 0x01));
}
--------------------------
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: possible bugfix for CHAR_LEADING_CODE() for multibyte raw bytes
2024-04-28 10:14 possible bugfix for CHAR_LEADING_CODE() for multibyte raw bytes Danny McClanahan
@ 2024-04-28 14:20 ` Eli Zaretskii
2024-07-27 12:27 ` Danny McClanahan
0 siblings, 1 reply; 3+ messages in thread
From: Eli Zaretskii @ 2024-04-28 14:20 UTC (permalink / raw)
To: Danny McClanahan; +Cc: emacs-devel
> Date: Sun, 28 Apr 2024 10:14:53 +0000
> From: Danny McClanahan <dmcc2@hypnicjerk.ai>
>
> While implementing the multibyte encoding scheme, I found what seemed to be an error in `CHAR_LEADING_CODE()` for raw bytes (multibyte codepoints encoded using two bytes which encode a single raw byte), where the multibyte code `c` isn't shifted downward the way it is in `char_string()` (see diff for fix at bottom).
Does the change you propose actually affect the result? The
"((c >> 6) & 0x01)" part takes just the LSB of (c >> 6), and
that doesn't seem to be affected by running c through
CHAR_TO_BYTE8, does it?
The leading byte of a multibyte sequence for raw bytes is either 0xC0
or 0xC1. That's what that part is trying to compute. Am I missing
something?
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: possible bugfix for CHAR_LEADING_CODE() for multibyte raw bytes
2024-04-28 14:20 ` Eli Zaretskii
@ 2024-07-27 12:27 ` Danny McClanahan
0 siblings, 0 replies; 3+ messages in thread
From: Danny McClanahan @ 2024-07-27 12:27 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: emacs-devel
> On Sunday, April 28th, 2024 at 10:20, Eli Zaretskii <eliz@gnu.org> wrote:
>
> Does the change you propose actually affect the result? The
> "((c >> 6) & 0x01)" part takes just the LSB of (c >> 6), and
>
> that doesn't seem to be affected by running c through
> CHAR_TO_BYTE8, does it?
>
> The leading byte of a multibyte sequence for raw bytes is either 0xC0
> or 0xC1. That's what that part is trying to compute. Am I missing
> something?
I think you're right! I was very narrowly focused on the text of the code and not the bits it was computing: it's definitely clear how that's computing just the LSB there. I'm still getting more familiar with bit-level manipulation, but I think that's clear enough now. I made this change in a branch and it hasn't changed any behavior AFAICT.
Sorry for missing this reply--thanks so much for your thorough follow-up here!
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2024-07-27 12:27 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-28 10:14 possible bugfix for CHAR_LEADING_CODE() for multibyte raw bytes Danny McClanahan
2024-04-28 14:20 ` Eli Zaretskii
2024-07-27 12:27 ` Danny McClanahan
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).