possible bugfix for CHAR_LEADING_CODE() for multibyte raw bytes

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* possible bugfix for CHAR_LEADING_CODE() for multibyte raw bytes
@ 2024-04-28 10:14 Danny McClanahan
  2024-04-28 14:20 ` Eli Zaretskii
  0 siblings, 1 reply; 3+ messages in thread
From: Danny McClanahan @ 2024-04-28 10:14 UTC (permalink / raw)
  To: emacs-devel@gnu.org

Hello emacs-devel,

I have been getting familiar with the super simple and pragmatic Emacs multibyte encoding scheme and the work done in regex-emacs.c to manipulate it. I am now pretty confident that a byte-based automaton should be able to match against Emacs multibyte buffers (which iiuc are what most user-visible buffers are encoded in). This is super useful because it means:
(1) Emacs can make use of other external libraries which perform byte string matching without enforcing UTF-8 encoding (not just the one I'm working on).
(2) The matching loop can iterate directly over bytes in many cases instead of counting multibyte codepoints (I believe regex-emacs.c counts multibyte codepoints in more cases than necessary, but not sure of this yet).

While implementing the multibyte encoding scheme, I found what seemed to be an error in `CHAR_LEADING_CODE()` for raw bytes (multibyte codepoints encoded using two bytes which encode a single raw byte), where the multibyte code `c` isn't shifted downward the way it is in `char_string()` (see diff for fix at bottom).

This seems never to be triggered because regex-emacs.c never calls `CHAR_LEADING_CODE()` when generating the fastmap for raw bytes (e.g. lines 3095-3098):
```c
	  /* Cover the case of matching a raw char in a
	     multibyte regexp against unibyte.	*/
	  if (CHAR_BYTE8_HEAD_P (p[1]))
	    data->fastmap[CHAR_TO_BYTE8 (STRING_CHAR (p + 1))] = 1;
```

Not sure if I've got this right, so please correct me if not!

(Note: the below patch requires multiple trailing newlines in order to apply successfully, so I've enclosed it within "-----". I'm not sure how to avoid this: I just used `git diff master..`.)

-------------------------
diff --git a/src/character.h b/src/character.h
index 6d0f035c2bb..d7a6a4f525c 100644
--- a/src/character.h
+++ b/src/character.h
@@ -216,7 +216,7 @@ CHAR_LEADING_CODE (int c)
 	  : c <= MAX_3_BYTE_CHAR ? 0xE0 | (c >> 12)
 	  : c <= MAX_4_BYTE_CHAR ? 0xF0 | (c >> 18)
 	  : c <= MAX_5_BYTE_CHAR ? 0xF8
-	  : 0xC0 | ((c >> 6) & 0x01));
+	  : 0xC0 | ((CHAR_TO_BYTE8(c) >> 6) & 0x01));
 }

--------------------------

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: possible bugfix for CHAR_LEADING_CODE() for multibyte raw bytes
  2024-04-28 10:14 possible bugfix for CHAR_LEADING_CODE() for multibyte raw bytes Danny McClanahan
@ 2024-04-28 14:20 ` Eli Zaretskii
  2024-07-27 12:27   ` Danny McClanahan
  0 siblings, 1 reply; 3+ messages in thread
From: Eli Zaretskii @ 2024-04-28 14:20 UTC (permalink / raw)
  To: Danny McClanahan; +Cc: emacs-devel

> Date: Sun, 28 Apr 2024 10:14:53 +0000
> From: Danny McClanahan <dmcc2@hypnicjerk.ai>
> 
> While implementing the multibyte encoding scheme, I found what seemed to be an error in `CHAR_LEADING_CODE()` for raw bytes (multibyte codepoints encoded using two bytes which encode a single raw byte), where the multibyte code `c` isn't shifted downward the way it is in `char_string()` (see diff for fix at bottom).

Does the change you propose actually affect the result?  The
"((c >> 6) & 0x01)" part takes just the LSB of (c >> 6), and
that doesn't seem to be affected by running c through
CHAR_TO_BYTE8, does it?

The leading byte of a multibyte sequence for raw bytes is either 0xC0
or 0xC1.  That's what that part is trying to compute.  Am I missing
something?



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: possible bugfix for CHAR_LEADING_CODE() for multibyte raw bytes
  2024-04-28 14:20 ` Eli Zaretskii
@ 2024-07-27 12:27   ` Danny McClanahan
  0 siblings, 0 replies; 3+ messages in thread
From: Danny McClanahan @ 2024-07-27 12:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> On Sunday, April 28th, 2024 at 10:20, Eli Zaretskii <eliz@gnu.org> wrote:
> 
> Does the change you propose actually affect the result? The
> "((c >> 6) & 0x01)" part takes just the LSB of (c >> 6), and
> 
> that doesn't seem to be affected by running c through
> CHAR_TO_BYTE8, does it?
> 
> The leading byte of a multibyte sequence for raw bytes is either 0xC0
> or 0xC1. That's what that part is trying to compute. Am I missing
> something?

I think you're right! I was very narrowly focused on the text of the code and not the bits it was computing: it's definitely clear how that's computing just the LSB there. I'm still getting more familiar with bit-level manipulation, but I think that's clear enough now. I made this change in a branch and it hasn't changed any behavior AFAICT.

Sorry for missing this reply--thanks so much for your thorough follow-up here!

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-07-27 12:27 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-28 10:14 possible bugfix for CHAR_LEADING_CODE() for multibyte raw bytes Danny McClanahan
2024-04-28 14:20 ` Eli Zaretskii
2024-07-27 12:27   ` Danny McClanahan

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).