From: Joseph Turner <joseph@ushin.org>
To: Eli Zaretskii <eliz@gnu.org>
Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net
Subject: Re: How to get buffer byte length (not number of characters)?
Date: Thu, 22 Aug 2024 00:24:45 -0700 [thread overview]
Message-ID: <87ed6hdnpe.fsf@ushin.org> (raw)
In-Reply-To: <86plq1td4n.fsf@gnu.org> (Eli Zaretskii's message of "Thu, 22 Aug 2024 07:06:32 +0300")
[-- Attachment #1: Type: text/plain, Size: 4002 bytes --]
Eli Zaretskii <eliz@gnu.org> writes:
>> From: Joseph Turner <joseph@ushin.org>
>> Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net
>> Date: Wed, 21 Aug 2024 16:52:39 -0700
>>
>> Eli Zaretskii <eliz@gnu.org> writes:
>>
>> >> Currently, plz.el always creates the curl subprocess like so:
>> >>
>> >> (make-process :coding 'binary ...)
>> >>
>> >> https://git.savannah.gnu.org/cgit/emacs/elpa.git/tree/plz.el?h=externals-release/plz#n519
>> >>
>> >> Does this DTRT?
>> >
>> > It could be TRT if plz.el encodes the buffer text "by hand" before
>> > sending the results to curl and decodes it when it receives text from
>> > curl. Which I think is what happens there.
>>
>> plz.el does not manually encode buffer text *within Emacs* when sending
>> requests to curl, but by default, plz.el sends data to curl with --data,
>> which tells curl to strip CR and newlines. With the :body-type 'binary
>> argument, plz.el instead uses --data-binary, which does no conversion.
>
> Newlines is a relatively minor issue (although it, too, needs to be
> considered). My main concern is with the text encoding. How can it
> be TRT to use 'binary when sending buffer text to curl? that would
> mean we are more-or-less always sending the internal representation of
> characters, which is superset of UTF-8. If the data was originally
> encoded in anything but UTF-8, reading it into Emacs and then sending
> it back will change the byte sequences from that other encoding to
> UTF-8. Moreover, 'binary does not guarantee that the result is valid
> UTF-8.
>
> So maybe I misunderstand how these plz.el facilities are used, but up
> front this sounds like a mistake.
It could be. Eli, Adam, what do you think about the default coding
systems for encoding the request body in the attached patch?
>> We don't want to strip newlines from hyperdrive files, so we always use
>> :body-type 'binary when sending buffer contents. Should hyperdrive.el
>> encode data with `buffer-file-coding-system' before passing to plz.el?
>
> I would think so, but maybe we should bring the plz.el developers on
> board of this discussion.
I've CC'd Adam.
>> When receiving text from curl, plz.el optionally decodes the text
>> according to the charset in the 'Content-Type' header, e.g., "text/html;
>> charset=utf-8" or utf-8 if no charset is found.
>
> By "optionally" you mean that it doesn't always happen, except if the
> caller requests that? If so, the caller of plz.el should decode the
> text manually before using it in user-facing features.
By default, `plz' decodes response body according to the 'Content-Type'
charset (or utf-8 as fallback). Passing `:decode nil' stops that.
>> Perhaps hyperdrive.el should check the 'Content-Type' header charset,
>> then fallback to guessing the coding system based on filename and file
>> contents with `set-auto-coding' (to avoid decoding images, etc.), and
>> then finally fallback to something else?
>
> Probably. But then I don't know anything about hyperdrive.el, either.
> If it copies text between files or URLs without showing it to the
> user, then the best strategy is indeed not to decode and encode stuff,
> but handle it as a stream of raw bytes. (In that case, my suggestion
> would be to use unibyte buffers and strings for temporarily storing
> and processing these raw bytes in Emacs.) But if the text is somehow
> shown to the user, it must be decoded to be displayed correctly by
> Emacs. And then it must be encoded back when writing it back to the
> external storage.
Thanks! Good to know about unibyte buffers and strings for that.
hyperdrive.el does show text to the user, so we'll likely do something
like what I described above. What fallback encoding should we use if
there's no 'Content-Type' charset and `set-auto-coding' returns nil?
IIUC, there's no foolproof way to guess the encoding of unknown bytes.
default-file-name-coding-system?
Thank you!! I feel more solid in my understanding of encodings now.
Joseph
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Add-plz-BODY-CODING-argument-Add-default-encoding.patch --]
[-- Type: text/x-diff, Size: 3377 bytes --]
From a684ff680ab05f359b628623159b4d3392eb448e Mon Sep 17 00:00:00 2001
From: Joseph Turner <joseph@breatheoutbreathe.in>
Date: Thu, 22 Aug 2024 00:02:19 -0700
Subject: [PATCH] Add: (plz) BODY-CODING argument; Add default encoding
Previously, strings and buffers were sent to curl as the internal
Emacs representation. Now strings and buffers are encoded, and the
BODY-CODING argument can used to override the default coding systems.
---
plz.el | 29 ++++++++++++++++++++++++++---
1 file changed, 26 insertions(+), 3 deletions(-)
diff --git a/plz.el b/plz.el
index 903d71e..91d41d2 100644
--- a/plz.el
+++ b/plz.el
@@ -323,7 +323,7 @@ (defalias 'plz--generate-new-buffer
;;;;; Public
-(cl-defun plz (method url &rest rest &key headers body else filter finally noquery timeout
+(cl-defun plz (method url &rest rest &key headers body else filter finally noquery timeout body-coding
(as 'string) (then 'sync)
(body-type 'text) (decode t decode-s)
(connect-timeout plz-connect-timeout))
@@ -340,6 +340,13 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
BODY-TYPE may be `text' to send BODY as text, or `binary' to send
it as binary.
+BODY-CODING may a coding system used to encode BODY before
+passing it to curl. BODY-CODING has no effect when BODY is a
+list like `(file FILENAME)'. If nil and BODY is a string, the
+default process I/O output coding system is used. If nil and
+BODY is a buffer, the buffer-local value of
+`buffer-file-coding-system' is used.
+
AS selects the kind of result to pass to the callback function
THEN, or the kind of result to return for synchronous requests.
It may be:
@@ -416,6 +423,19 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
(declare (indent defun))
(setf decode (if (and decode-s (not decode))
nil decode))
+ (unless body-coding
+ (pcase-exhaustive body
+ (`(file ,filename)
+ ;; Don't set BODY-CODING; files are passed as-is to curl.
+ (setf body-coding nil))
+ ((pred stringp)
+ ;; Use default output coding for processes.
+ (setf body-coding (cdr default-process-coding-system)))
+ ((and (pred bufferp) buffer)
+ ;; Use buffer-local coding.
+ (setf body-coding
+ (buffer-local-value 'buffer-file-coding-system buffer)))))
+
;; NOTE: By default, for PUT requests and POST requests >1KB, curl sends an
;; "Expect:" header, which causes servers to send a "100 Continue" response, which
;; we don't want to have to deal with, so we disable it by setting the header to
@@ -553,8 +573,11 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
(process-send-string process curl-config)
(when body
(cl-typecase body
- (string (process-send-string process body))
- (buffer (with-current-buffer body
+ (string (process-send-string
+ process (encode-coding-string body body-coding t)))
+ (buffer (with-temp-buffer
+ (insert-buffer-substring-no-properties body)
+ (encode-coding-region (point-min) (point-max) body-coding)
(process-send-region process (point-min) (point-max))))))
(process-send-eof process)
(if sync-p
--
2.41.0
next prev parent reply other threads:[~2024-08-22 7:24 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-20 7:10 How to get buffer byte length (not number of characters)? Joseph Turner
2024-08-20 7:51 ` Joseph Turner
2024-08-20 11:20 ` Eli Zaretskii
2024-08-20 11:15 ` Eli Zaretskii
2024-08-21 9:20 ` Joseph Turner
2024-08-21 17:47 ` Eli Zaretskii
2024-08-21 23:52 ` Joseph Turner
2024-08-22 4:06 ` Eli Zaretskii
2024-08-22 7:24 ` Joseph Turner [this message]
2024-08-22 11:04 ` Eli Zaretskii
2024-08-22 18:29 ` Joseph Turner
2024-08-22 18:44 ` Eli Zaretskii
2024-08-22 19:32 ` tomas
2024-08-23 3:56 ` Joseph Turner
2024-08-23 7:02 ` Eli Zaretskii
2024-08-23 7:37 ` Joseph Turner
2024-08-23 12:34 ` Eli Zaretskii
2024-08-23 7:43 ` Joseph Turner
2024-08-23 12:38 ` Eli Zaretskii
2024-08-23 16:59 ` Joseph Turner
2024-08-23 17:35 ` Eli Zaretskii
2024-08-23 20:37 ` Joseph Turner
2024-08-24 6:14 ` Joseph Turner
2024-08-22 12:26 ` Adam Porter
2024-08-22 12:47 ` tomas
2024-08-23 6:28 ` Adam Porter
2024-08-22 13:50 ` Eli Zaretskii
2024-08-23 6:31 ` Adam Porter
2024-08-23 6:51 ` Eli Zaretskii
2024-08-23 7:07 ` Joseph Turner
2024-08-23 7:58 ` Joseph Turner
2024-08-22 7:09 ` Andreas Schwab
2024-08-22 7:30 ` Joseph Turner
2024-08-22 11:05 ` Eli Zaretskii
2024-08-26 6:37 ` Joseph Turner
2024-08-26 6:49 ` Joseph Turner
2024-08-26 11:22 ` Eli Zaretskii
2024-08-27 4:48 ` Joseph Turner
2024-08-26 11:20 ` Eli Zaretskii
2024-08-20 11:24 ` Andreas Schwab
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ed6hdnpe.fsf@ushin.org \
--to=joseph@ushin.org \
--cc=adam@alphapapa.net \
--cc=eliz@gnu.org \
--cc=emacs-devel@gnu.org \
--cc=schwab@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.