bug#53260: char-syntax differs in interpreter and bytecode

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#53260: char-syntax differs in interpreter and bytecode
@ 2022-01-14 16:43 Mattias Engdegård
  2022-01-15  8:36 ` Lars Ingebrigtsen
  0 siblings, 1 reply; 9+ messages in thread
From: Mattias Engdegård @ 2022-01-14 16:43 UTC (permalink / raw)
  To: 53260

Fchar_syntax and the bytecode Bchar_syntax differ:

Fchar_syntax calls SETUP_BUFFER_SYNTAX_TABLE. Bchar_syntax does not.
Bchar_syntax converts arguments to multibyte. Fchar_syntax does not.

The last property can be used to get different behaviour:

(let ((cs (byte-compile (lambda (x) (char-syntax x)))))
  (with-temp-buffer
    (let ((st (make-syntax-table)))
      (set-buffer-multibyte nil)
      (modify-syntax-entry 128 "_" st)
      (set-syntax-table st)
      (list (funcall cs 128) (char-syntax 128)))))
-> (119 95)

Not sure how to expose the presence or absence of SETUP_BUFFER_SYNTAX_TABLE. Suggestions?

And, most importantly, what would be the correct code?

(I suppose char-syntax is rare enough that we could call Fchar_syntax from Bchar_syntax and thus avoid any future divergence.)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#53260: char-syntax differs in interpreter and bytecode
  2022-01-14 16:43 bug#53260: char-syntax differs in interpreter and bytecode Mattias Engdegård
@ 2022-01-15  8:36 ` Lars Ingebrigtsen
  2022-01-15 14:46   ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 1 reply; 9+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-15  8:36 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 53260, Stefan Monnier

Mattias Engdegård <mattiase@acm.org> writes:

> Fchar_syntax and the bytecode Bchar_syntax differ:
>
> Fchar_syntax calls SETUP_BUFFER_SYNTAX_TABLE. Bchar_syntax does not.
> Bchar_syntax converts arguments to multibyte. Fchar_syntax does not.

[...]

> And, most importantly, what would be the correct code?

Hm.  Perhaps Stefan has an opinion; added to the CCs.

> (I suppose char-syntax is rare enough that we could call Fchar_syntax
> from Bchar_syntax and thus avoid any future divergence.)

Used 172 times in-core, which isn't that rare...

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#53260: char-syntax differs in interpreter and bytecode
  2022-01-15  8:36 ` Lars Ingebrigtsen
@ 2022-01-15 14:46   ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2022-01-15 17:29     ` Mattias Engdegård
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2022-01-15 14:46 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: Mattias Engdegård, 53260

Lars Ingebrigtsen [2022-01-15 09:36:16] wrote:
> Mattias Engdegård <mattiase@acm.org> writes:
>> Fchar_syntax and the bytecode Bchar_syntax differ:
>> Fchar_syntax calls SETUP_BUFFER_SYNTAX_TABLE. Bchar_syntax does not.
>> Bchar_syntax converts arguments to multibyte.  Fchar_syntax does not.
> [...]
>> And, most importantly, what would be the correct code?
> Hm.  Perhaps Stefan has an opinion; added to the CCs.

My past opinion is in its docstring:

    If you’re trying to determine the syntax of characters in the buffer,
    this is probably the wrong function to use, because it can’t take
    ‘syntax-table’ text properties into account.  Consider using
    ‘syntax-after’ instead.

The "can't" is because `char-syntax` doesn't know where the char comes from.

>> (I suppose char-syntax is rare enough that we could call Fchar_syntax
>> from Bchar_syntax and thus avoid any future divergence.)
> Used 172 times in-core, which isn't that rare...

I think he meant "rare" w.r.t dynamic count rather than static count.


        Stefan






^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#53260: char-syntax differs in interpreter and bytecode
  2022-01-15 14:46   ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2022-01-15 17:29     ` Mattias Engdegård
  2022-01-15 17:57       ` Eli Zaretskii
  2022-01-15 22:51       ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 2 replies; 9+ messages in thread
From: Mattias Engdegård @ 2022-01-15 17:29 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Lars Ingebrigtsen, 53260

15 jan. 2022 kl. 15.46 skrev Stefan Monnier <monnier@iro.umontreal.ca>:

>    If you’re trying to determine the syntax of characters in the buffer,
>    this is probably the wrong function to use, because it can’t take
>    ‘syntax-table’ text properties into account.  Consider using
>    ‘syntax-after’ instead.
> 
> The "can't" is because `char-syntax` doesn't know where the char comes from.

This is true and it leaves a narrower use for `char-syntax` in mode-specific code -- ie, when syntax-table text properties do not need to be taken into account.

I propose we do the following:

1. Remove SETUP_BUFFER_SYNTAX_TABLE() from Fchar_syntax because as far as I can tell it has no effect at all.

2. Remove make_char_multibyte(c) from Bchar_syntax because it seems to be the wrong thing to do: in a unibyte buffer, wouldn't the syntax table be indexed by byte value (so that char 255 in the buffer corresponds to entry 255 in the syntax table rather than entry 0x3fffff)?

3. Now both implementations are identical. Replace the one in the byte-code interpreter with a call to Fchar_syntax.

> I think he meant "rare" w.r.t dynamic count rather than static count.

Yes, that's right.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#53260: char-syntax differs in interpreter and bytecode
  2022-01-15 17:29     ` Mattias Engdegård
@ 2022-01-15 17:57       ` Eli Zaretskii
  2022-01-15 22:51       ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
  1 sibling, 0 replies; 9+ messages in thread
From: Eli Zaretskii @ 2022-01-15 17:57 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: larsi, 53260, monnier

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 15 Jan 2022 18:29:41 +0100
> Cc: Lars Ingebrigtsen <larsi@gnus.org>, 53260@debbugs.gnu.org
> 
> 2. Remove make_char_multibyte(c) from Bchar_syntax because it seems to be the wrong thing to do: in a unibyte buffer, wouldn't the syntax table be indexed by byte value (so that char 255 in the buffer corresponds to entry 255 in the syntax table rather than entry 0x3fffff)?

I don't think we want to support unibyte buffers which have some text
that is syntactically significant.  A unibyte buffer is just a stream
of raw bytes, they are not characters.





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#53260: char-syntax differs in interpreter and bytecode
  2022-01-15 17:29     ` Mattias Engdegård
  2022-01-15 17:57       ` Eli Zaretskii
@ 2022-01-15 22:51       ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2022-01-16 11:04         ` bug#53260: char-syntax differs in interpreter and bytecode [PATCH] Mattias Engdegård
  1 sibling, 1 reply; 9+ messages in thread
From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2022-01-15 22:51 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Lars Ingebrigtsen, 53260

> 1. Remove SETUP_BUFFER_SYNTAX_TABLE() from Fchar_syntax because as far as
> I can tell it has no effect at all.

Sounds good.

> 2. Remove make_char_multibyte(c) from Bchar_syntax because it seems to be
> the wrong thing to do: in a unibyte buffer, wouldn't the syntax table be
> indexed by byte value (so that char 255 in the buffer corresponds to entry
> 255 in the syntax table rather than entry 0x3fffff)?

Doesn't sound right: char tables are indexed by chars (i.e. Unicode code
points) not by bytes, so we need to convert the byte into a char
before indexing.

> 3. Now both implementations are identical. Replace the one in the byte-code
> interpreter with a call to Fchar_syntax.

Sounds good.


        Stefan






^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#53260: char-syntax differs in interpreter and bytecode [PATCH]
  2022-01-15 22:51       ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2022-01-16 11:04         ` Mattias Engdegård
  2022-01-20  9:30           ` Lars Ingebrigtsen
  0 siblings, 1 reply; 9+ messages in thread
From: Mattias Engdegård @ 2022-01-16 11:04 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Lars Ingebrigtsen, 53260

[-- Attachment #1: Type: text/plain, Size: 465 bytes --]

15 jan. 2022 kl. 23.51 skrev Stefan Monnier <monnier@iro.umontreal.ca>:

> Doesn't sound right: char tables are indexed by chars (i.e. Unicode code
> points) not by bytes, so we need to convert the byte into a char
> before indexing.

Sure, I'm happy to do it either way. Chars retrieved from unibyte buffers or strings really should be converted to multibyte before used with char-syntax; unibyte buffers are not very common but strings slightly more so.


[-- Attachment #2: 0001-Fix-Fchar_syntax-for-non-ASCII-in-unibyte-buffers.patch --]
[-- Type: application/octet-stream, Size: 3163 bytes --]

From 2adb5c862232abf126f73d0aa514f5ae8b6babba Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Sun, 16 Jan 2022 11:58:00 +0100
Subject: [PATCH] Fix Fchar_syntax for non-ASCII in unibyte buffers

Fchar_syntax did not convert unibyte characters to multibyte when the
current buffer was unibyte, in contrast to `char-syntax` in
byte-compiled code (bug#53260).

* src/bytecode.c (exec_byte_code): Call out to Fchar_syntax;
the dynamic frequency is too low to justify inlining here, and it
did lead to implementations diverging.
* src/syntax.c (Fchar_syntax): Convert non-ASCII unibyte values to
multibyte.  Remove useless SETUP_BUFFER_SYNTAX_TABLE which has no
effect here.
* test/src/syntax-tests.el (syntax-char-syntax): New test.
---
 src/bytecode.c           |  8 +-------
 src/syntax.c             |  6 +++---
 test/src/syntax-tests.el | 15 +++++++++++++++
 3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/src/bytecode.c b/src/bytecode.c
index 472992be18..b7e65d05ae 100644
--- a/src/bytecode.c
+++ b/src/bytecode.c
@@ -1167,13 +1167,7 @@ #define DEFINE(name, value) LABEL (name) ,
 	  NEXT;
 
 	CASE (Bchar_syntax):
-	  {
-	    CHECK_CHARACTER (TOP);
-	    int c = XFIXNAT (TOP);
-	    if (NILP (BVAR (current_buffer, enable_multibyte_characters)))
-	      c = make_char_multibyte (c);
-	    XSETFASTINT (TOP, syntax_code_spec[SYNTAX (c)]);
-	  }
+	  TOP = Fchar_syntax (TOP);
 	  NEXT;
 
 	CASE (Bbuffer_substring):
diff --git a/src/syntax.c b/src/syntax.c
index 9df878b8ed..c1e81dfa47 100644
--- a/src/syntax.c
+++ b/src/syntax.c
@@ -1101,10 +1101,10 @@ DEFUN ("char-syntax", Fchar_syntax, Schar_syntax, 1, 1, 0,
 `syntax-after' instead.  */)
   (Lisp_Object character)
 {
-  int char_int;
   CHECK_CHARACTER (character);
-  char_int = XFIXNUM (character);
-  SETUP_BUFFER_SYNTAX_TABLE ();
+  int char_int = XFIXNAT (character);
+  if (NILP (BVAR (current_buffer, enable_multibyte_characters)))
+    char_int = make_char_multibyte (char_int);
   return make_fixnum (syntax_code_spec[SYNTAX (char_int)]);
 }
 
diff --git a/test/src/syntax-tests.el b/test/src/syntax-tests.el
index 3b9f21cde3..501b5e067f 100644
--- a/test/src/syntax-tests.el
+++ b/test/src/syntax-tests.el
@@ -506,4 +506,19 @@ test-from-to-parse-partial-sexp
     (should (parse-partial-sexp 1 1))
     (should-error (parse-partial-sexp 2 1))))
 
+(ert-deftest syntax-char-syntax ()
+  ;; Verify that char-syntax behaves identically in interpreted and
+  ;; byte-compiled code (bug#53260).
+  (let ((cs (byte-compile (lambda (x) (char-syntax x)))))
+    ;; Use a unibyte buffer with a syntax table using symbol syntax
+    ;; for raw byte 128.
+    (with-temp-buffer
+      (set-buffer-multibyte nil)
+      (let ((st (make-syntax-table)))
+        (modify-syntax-entry (unibyte-char-to-multibyte 128) "_" st)
+        (set-syntax-table st)
+        (should (equal (char-syntax 128) ?_))
+        (should (equal (funcall cs 128) ?_))))
+    (list (char-syntax 128) (funcall cs 128))))
+
 ;;; syntax-tests.el ends here
-- 
2.32.0 (Apple Git-132)


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* bug#53260: char-syntax differs in interpreter and bytecode [PATCH]
  2022-01-16 11:04         ` bug#53260: char-syntax differs in interpreter and bytecode [PATCH] Mattias Engdegård
@ 2022-01-20  9:30           ` Lars Ingebrigtsen
  2022-01-20 10:47             ` Mattias Engdegård
  0 siblings, 1 reply; 9+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-20  9:30 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 53260, Stefan Monnier

Mattias Engdegård <mattiase@acm.org> writes:

> Sure, I'm happy to do it either way. Chars retrieved from unibyte
> buffers or strings really should be converted to multibyte before used
> with char-syntax; unibyte buffers are not very common but strings
> slightly more so.

Makes sense to me.  Unless Stefan has any further comments, please go
ahead and push.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#53260: char-syntax differs in interpreter and bytecode [PATCH]
  2022-01-20  9:30           ` Lars Ingebrigtsen
@ 2022-01-20 10:47             ` Mattias Engdegård
  0 siblings, 0 replies; 9+ messages in thread
From: Mattias Engdegård @ 2022-01-20 10:47 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: Stefan Monnier, 53260-done

20 jan. 2022 kl. 10.30 skrev Lars Ingebrigtsen <larsi@gnus.org>:

> Makes sense to me.  Unless Stefan has any further comments, please go
> ahead and push.

Thank you, pushed with a necessary modification: SETUP_BUFFER_SYNTAX_TABLE() is indeed necessary in Fchar_syntax because syntax.c has its own local #define SYNTAX() and doesn't use the one in syntax.h. Lovely.






^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-01-20 10:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-14 16:43 bug#53260: char-syntax differs in interpreter and bytecode Mattias Engdegård
2022-01-15  8:36 ` Lars Ingebrigtsen
2022-01-15 14:46   ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-01-15 17:29     ` Mattias Engdegård
2022-01-15 17:57       ` Eli Zaretskii
2022-01-15 22:51       ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-01-16 11:04         ` bug#53260: char-syntax differs in interpreter and bytecode [PATCH] Mattias Engdegård
2022-01-20  9:30           ` Lars Ingebrigtsen
2022-01-20 10:47             ` Mattias Engdegård

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).