* bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
@ 2024-05-16 18:13 Mattias Engdegård
2024-05-16 18:47 ` Eli Zaretskii
0 siblings, 1 reply; 8+ messages in thread
From: Mattias Engdegård @ 2024-05-16 18:13 UTC (permalink / raw)
To: 70988; +Cc: Stefan Monnier
[-- Attachment #1: Type: text/plain, Size: 606 bytes --]
When `read` is called with a function as stream argument, the return values of that function are often interpreted as Latin-1 characters with only the 8 low bits used. Example:
(let* ((next '(?A #x12a nil))
(f (lambda (&rest args)
(if args
(push (car args) next)
(pop next)))))
(read f))
=> A* ; expected: AĪ
This is a result of `readchar` setting *multibyte to 0 on this code path.
The reader is not very consistent: inside string and character literals, the code seems to work as expected.
The fix is straightforward (attached).
[-- Attachment #2: read-from-function.diff --]
[-- Type: application/octet-stream, Size: 1404 bytes --]
diff --git a/src/lread.c b/src/lread.c
index c92b2ede932..2626272c4e2 100644
--- a/src/lread.c
+++ b/src/lread.c
@@ -422,6 +422,8 @@ readchar (Lisp_Object readcharfun, bool *multibyte)
goto read_multibyte;
}
+ if (multibyte)
+ *multibyte = 1;
tem = call0 (readcharfun);
if (NILP (tem))
diff --git a/test/src/lread-tests.el b/test/src/lread-tests.el
index cc17f7eb3fa..41c9256a9bf 100644
--- a/test/src/lread-tests.el
+++ b/test/src/lread-tests.el
@@ -387,4 +387,19 @@ lread-skip-to-eof
(goto-char (point-min))
(should-error (read (current-buffer)) :type 'end-of-file)))
+(ert-deftest lread-from-function ()
+ ;; Test reading from a stream defined by a function.
+ (let ((make-reader (lambda (chars)
+ (lambda (&rest args)
+ (if args
+ (push (car args) chars)
+ (pop chars))))))
+ (dolist (seq '((?A ?B) (?E ?ä ?ÿ) (?A ?Ω) (?* ?☃) (?a #o303 #o245 ?b)))
+ (let ((str (apply #'string seq)))
+ (should (eq (read (funcall make-reader seq)) (intern str)))
+ (let ((quoted-seq `(?\" ,@seq ?\")))
+ (should (equal (read (funcall make-reader quoted-seq)) str)))))
+ (dolist (c '(?A ?ä ?ÿ ?Ω ?☃))
+ (should (eq (read (funcall make-reader `(?? ,c))) c)))))
+
;;; lread-tests.el ends here
^ permalink raw reply related [flat|nested] 8+ messages in thread
* bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
2024-05-16 18:13 bug#70988: (read FUNCTION) uses Latin-1 [PATCH] Mattias Engdegård
@ 2024-05-16 18:47 ` Eli Zaretskii
2024-05-16 19:45 ` Mattias Engdegård
0 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2024-05-16 18:47 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: 70988, monnier
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>
> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Thu, 16 May 2024 20:13:18 +0200
>
> When `read` is called with a function as stream argument, the return values of that function are often interpreted as Latin-1 characters with only the 8 low bits used. Example:
>
> (let* ((next '(?A #x12a nil))
> (f (lambda (&rest args)
> (if args
> (push (car args) next)
> (pop next)))))
> (read f))
> => A* ; expected: AĪ
>
> This is a result of `readchar` setting *multibyte to 0 on this code path.
When is this situation relevant? How many uses of
function-as-a-stream are there out there?
In general, I wouldn't touch these rare cases with a 3-mile pole. The
gain is generally very small (satisfaction from some abstract sense of
correctness aside), while the risk to break some code is usually high.
It is better to document this behavior and move on.
> The fix is straightforward (attached).
>
> diff --git a/src/lread.c b/src/lread.c
> index c92b2ede932..2626272c4e2 100644
> --- a/src/lread.c
> +++ b/src/lread.c
> @@ -422,6 +422,8 @@ readchar (Lisp_Object readcharfun, bool *multibyte)
> goto read_multibyte;
> }
>
> + if (multibyte)
> + *multibyte = 1;
> tem = call0 (readcharfun);
Is it an accident that the code does the same only _after_ the call to
readbyte?
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
2024-05-16 18:47 ` Eli Zaretskii
@ 2024-05-16 19:45 ` Mattias Engdegård
2024-05-16 19:54 ` Eli Zaretskii
0 siblings, 1 reply; 8+ messages in thread
From: Mattias Engdegård @ 2024-05-16 19:45 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 70988, monnier
16 maj 2024 kl. 20.47 skrev Eli Zaretskii <eliz@gnu.org>:
> When is this situation relevant? How many uses of
> function-as-a-stream are there out there?
Not many is my guess, which is perhaps why it wasn't found before.
I'm doing some performance work on the reader, and quirks in the code like these become obvious.
> Is it an accident that the code does the same only _after_ the call to
> readbyte?
Yes, I have no reason to believe otherwise.
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
2024-05-16 19:45 ` Mattias Engdegård
@ 2024-05-16 19:54 ` Eli Zaretskii
2024-05-17 8:08 ` Mattias Engdegård
0 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2024-05-16 19:54 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: 70988, monnier
> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Thu, 16 May 2024 21:45:56 +0200
> Cc: 70988@debbugs.gnu.org,
> monnier@iro.umontreal.ca
>
> > Is it an accident that the code does the same only _after_ the call to
> > readbyte?
>
> Yes, I have no reason to believe otherwise.
To me, it actually looks as done on purpose.
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
2024-05-16 19:54 ` Eli Zaretskii
@ 2024-05-17 8:08 ` Mattias Engdegård
2024-05-17 10:45 ` Eli Zaretskii
0 siblings, 1 reply; 8+ messages in thread
From: Mattias Engdegård @ 2024-05-17 8:08 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 70988, monnier
16 maj 2024 kl. 21.54 skrev Eli Zaretskii <eliz@gnu.org>:
>>> Is it an accident that the code does the same only _after_ the call to
>>> readbyte?
>>
>> Yes, I have no reason to believe otherwise.
>
> To me, it actually looks as done on purpose.
You could very well be right about that. What I meant is that the order doesn't matter at all.
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
2024-05-17 8:08 ` Mattias Engdegård
@ 2024-05-17 10:45 ` Eli Zaretskii
2024-05-17 17:08 ` Mattias Engdegård
0 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2024-05-17 10:45 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: 70988, monnier
> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Fri, 17 May 2024 10:08:58 +0200
> Cc: 70988@debbugs.gnu.org,
> monnier@iro.umontreal.ca
>
> 16 maj 2024 kl. 21.54 skrev Eli Zaretskii <eliz@gnu.org>:
>
> >>> Is it an accident that the code does the same only _after_ the call to
> >>> readbyte?
> >>
> >> Yes, I have no reason to believe otherwise.
> >
> > To me, it actually looks as done on purpose.
>
> You could very well be right about that. What I meant is that the order doesn't matter at all.
Doesn't it affect what the readbyte call does?
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
2024-05-17 10:45 ` Eli Zaretskii
@ 2024-05-17 17:08 ` Mattias Engdegård
2024-05-30 15:43 ` Mattias Engdegård
0 siblings, 1 reply; 8+ messages in thread
From: Mattias Engdegård @ 2024-05-17 17:08 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 70988, monnier
17 maj 2024 kl. 12.45 skrev Eli Zaretskii <eliz@gnu.org>:
>>>>> Is it an accident that the code does the same only _after_ the call to
>>>>> readbyte?
>>>>
>>>> Yes, I have no reason to believe otherwise.
>>>
>>> To me, it actually looks as done on purpose.
>>
>> You could very well be right about that. What I meant is that the order doesn't matter at all.
>
> Doesn't it affect what the readbyte call does?
No -- the `*multibyte = ...` assignment is just an extra return value, which indicates whether the returned values come from a unibyte or multibyte source. For any given source (READCHARFUN, in the terminology of lread.c), the characters will all be unibyte or multibyte, so this returned `multibyte` flag will typically only be used once by the caller and saved for future reference.
But you are right to question it because lread.c is a royal mess and many changes have not been made in a clean way. It is unclear whether it's worth returning the `multibyte` flag at all; it's only used in special cases.
^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
2024-05-17 17:08 ` Mattias Engdegård
@ 2024-05-30 15:43 ` Mattias Engdegård
0 siblings, 0 replies; 8+ messages in thread
From: Mattias Engdegård @ 2024-05-30 15:43 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 70988, monnier
After looking further into the Lisp reader/printer I found two more silent Latin-1 assumptions. In all three cases, I firmly believe the following to be true:
* The behaviour is not intended but just code accidents.
* They should hardly affect any user code at all.
* They are nevertheless clear bugs which should be fixed.
Further on this will have to wait until after Emacs 30 has been branched to avoid delaying that more important task.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-05-30 15:43 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-16 18:13 bug#70988: (read FUNCTION) uses Latin-1 [PATCH] Mattias Engdegård
2024-05-16 18:47 ` Eli Zaretskii
2024-05-16 19:45 ` Mattias Engdegård
2024-05-16 19:54 ` Eli Zaretskii
2024-05-17 8:08 ` Mattias Engdegård
2024-05-17 10:45 ` Eli Zaretskii
2024-05-17 17:08 ` Mattias Engdegård
2024-05-30 15:43 ` Mattias Engdegård
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).