all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Joseph Turner <joseph@ushin.org>
To: Eli Zaretskii <eliz@gnu.org>
Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
Subject: Re: How to get buffer byte length (not number of characters)?
Date: Thu, 22 Aug 2024 00:24:45 -0700	[thread overview]
Message-ID: <87ed6hdnpe.fsf@ushin.org> (raw)
In-Reply-To: <86plq1td4n.fsf@gnu.org> (Eli Zaretskii's message of "Thu, 22 Aug 2024 07:06:32 +0300")

[-- Attachment #1: Type: text/plain, Size: 4002 bytes --]

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Joseph Turner <joseph@ushin.org>
>> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
>> Date: Wed, 21 Aug 2024 16:52:39 -0700
>> 
>> Eli Zaretskii <eliz@gnu.org> writes:
>> 
>> >> Currently, plz.el always creates the curl subprocess like so:
>> >> 
>> >> (make-process :coding 'binary ...)
>> >> 
>> >> https://git.savannah.gnu.org/cgit/emacs/elpa.git/tree/plz.el?h=externals-release/plz#n519
>> >> 
>> >> Does this DTRT?
>> >
>> > It could be TRT if plz.el encodes the buffer text "by hand" before
>> > sending the results to curl and decodes it when it receives text from
>> > curl.  Which I think is what happens there.
>> 
>> plz.el does not manually encode buffer text *within Emacs* when sending
>> requests to curl, but by default, plz.el sends data to curl with --data,
>> which tells curl to strip CR and newlines.  With the :body-type 'binary
>> argument, plz.el instead uses --data-binary, which does no conversion.
>
> Newlines is a relatively minor issue (although it, too, needs to be
> considered).  My main concern is with the text encoding.  How can it
> be TRT to use 'binary when sending buffer text to curl? that would
> mean we are more-or-less always sending the internal representation of
> characters, which is superset of UTF-8.  If the data was originally
> encoded in anything but UTF-8, reading it into Emacs and then sending
> it back will change the byte sequences from that other encoding to
> UTF-8.  Moreover, 'binary does not guarantee that the result is valid
> UTF-8.
>
> So maybe I misunderstand how these plz.el facilities are used, but up
> front this sounds like a mistake.

It could be.  Eli, Adam, what do you think about the default coding
systems for encoding the request body in the attached patch?

>> We don't want to strip newlines from hyperdrive files, so we always use
>> :body-type 'binary when sending buffer contents.  Should hyperdrive.el
>> encode data with `buffer-file-coding-system' before passing to plz.el?
>
> I would think so, but maybe we should bring the plz.el developers on
> board of this discussion.

I've CC'd Adam.

>> When receiving text from curl, plz.el optionally decodes the text
>> according to the charset in the 'Content-Type' header, e.g., "text/html;
>> charset=utf-8" or utf-8 if no charset is found.
>
> By "optionally" you mean that it doesn't always happen, except if the
> caller requests that?  If so, the caller of plz.el should decode the
> text manually before using it in user-facing features.

By default, `plz' decodes response body according to the 'Content-Type'
charset (or utf-8 as fallback).  Passing `:decode nil' stops that.

>> Perhaps hyperdrive.el should check the 'Content-Type' header charset,
>> then fallback to guessing the coding system based on filename and file
>> contents with `set-auto-coding' (to avoid decoding images, etc.), and
>> then finally fallback to something else?
>
> Probably.  But then I don't know anything about hyperdrive.el, either.
> If it copies text between files or URLs without showing it to the
> user, then the best strategy is indeed not to decode and encode stuff,
> but handle it as a stream of raw bytes.  (In that case, my suggestion
> would be to use unibyte buffers and strings for temporarily storing
> and processing these raw bytes in Emacs.)  But if the text is somehow
> shown to the user, it must be decoded to be displayed correctly by
> Emacs.  And then it must be encoded back when writing it back to the
> external storage.

Thanks!  Good to know about unibyte buffers and strings for that.

hyperdrive.el does show text to the user, so we'll likely do something
like what I described above.  What fallback encoding should we use if
there's no 'Content-Type' charset and `set-auto-coding' returns nil?
IIUC, there's no foolproof way to guess the encoding of unknown bytes.

default-file-name-coding-system?

Thank you!!  I feel more solid in my understanding of encodings now.

Joseph


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Add-plz-BODY-CODING-argument-Add-default-encoding.patch --]
[-- Type: text/x-diff, Size: 3377 bytes --]

From a684ff680ab05f359b628623159b4d3392eb448e Mon Sep 17 00:00:00 2001
From: Joseph Turner <joseph@breatheoutbreathe.in>
Date: Thu, 22 Aug 2024 00:02:19 -0700
Subject: [PATCH] Add: (plz) BODY-CODING argument; Add default encoding

Previously, strings and buffers were sent to curl as the internal
Emacs representation.  Now strings and buffers are encoded, and the
BODY-CODING argument can used to override the default coding systems.
---
 plz.el | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/plz.el b/plz.el
index 903d71e..91d41d2 100644
--- a/plz.el
+++ b/plz.el
@@ -323,7 +323,7 @@ (defalias 'plz--generate-new-buffer
 
 ;;;;; Public
 
-(cl-defun plz (method url &rest rest &key headers body else filter finally noquery timeout
+(cl-defun plz (method url &rest rest &key headers body else filter finally noquery timeout body-coding
                       (as 'string) (then 'sync)
                       (body-type 'text) (decode t decode-s)
                       (connect-timeout plz-connect-timeout))
@@ -340,6 +340,13 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
 BODY-TYPE may be `text' to send BODY as text, or `binary' to send
 it as binary.
 
+BODY-CODING may a coding system used to encode BODY before
+passing it to curl.  BODY-CODING has no effect when BODY is a
+list like `(file FILENAME)'.  If nil and BODY is a string, the
+default process I/O output coding system is used.  If nil and
+BODY is a buffer, the buffer-local value of
+`buffer-file-coding-system' is used.
+
 AS selects the kind of result to pass to the callback function
 THEN, or the kind of result to return for synchronous requests.
 It may be:
@@ -416,6 +423,19 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
   (declare (indent defun))
   (setf decode (if (and decode-s (not decode))
                    nil decode))
+  (unless body-coding
+    (pcase-exhaustive body
+      (`(file ,filename)
+       ;; Don't set BODY-CODING; files are passed as-is to curl.
+       (setf body-coding nil))
+      ((pred stringp)
+       ;; Use default output coding for processes.
+       (setf body-coding (cdr default-process-coding-system)))
+      ((and (pred bufferp) buffer)
+       ;; Use buffer-local coding.
+       (setf body-coding
+             (buffer-local-value 'buffer-file-coding-system buffer)))))
+
   ;; NOTE: By default, for PUT requests and POST requests >1KB, curl sends an
   ;; "Expect:" header, which causes servers to send a "100 Continue" response, which
   ;; we don't want to have to deal with, so we disable it by setting the header to
@@ -553,8 +573,11 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
     (process-send-string process curl-config)
     (when body
       (cl-typecase body
-        (string (process-send-string process body))
-        (buffer (with-current-buffer body
+        (string (process-send-string
+                 process (encode-coding-string body body-coding t)))
+        (buffer (with-temp-buffer
+                  (insert-buffer-substring-no-properties body)
+                  (encode-coding-region (point-min) (point-max) body-coding)
                   (process-send-region process (point-min) (point-max))))))
     (process-send-eof process)
     (if sync-p
-- 
2.41.0


  reply	other threads:[~2024-08-22  7:24 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-20  7:10 How to get buffer byte length (not number of characters)? Joseph Turner
2024-08-20  7:51 ` Joseph Turner
2024-08-20 11:20   ` Eli Zaretskii
2024-08-20 11:15 ` Eli Zaretskii
2024-08-21  9:20   ` Joseph Turner
2024-08-21 17:47     ` Eli Zaretskii
2024-08-21 23:52       ` Joseph Turner
2024-08-22  4:06         ` Eli Zaretskii
2024-08-22  7:24           ` Joseph Turner [this message]
2024-08-22 11:04             ` Eli Zaretskii
2024-08-22 18:29               ` Joseph Turner
2024-08-22 18:44                 ` Eli Zaretskii
2024-08-22 19:32                   ` tomas
2024-08-23  3:56                   ` Joseph Turner
2024-08-23  7:02                     ` Eli Zaretskii
2024-08-23  7:37                       ` Joseph Turner
2024-08-23 12:34                         ` Eli Zaretskii
2024-08-23  7:43                       ` Joseph Turner
2024-08-23 12:38                         ` Eli Zaretskii
2024-08-23 16:59                           ` Joseph Turner
2024-08-23 17:35                             ` Eli Zaretskii
2024-08-23 20:37                               ` Joseph Turner
2024-08-24  6:14                     ` Joseph Turner
2024-08-22 12:26             ` Adam Porter
2024-08-22 12:47               ` tomas
2024-08-23  6:28                 ` Adam Porter
2024-08-22 13:50               ` Eli Zaretskii
2024-08-23  6:31                 ` Adam Porter
2024-08-23  6:51                   ` Eli Zaretskii
2024-08-23  7:07                   ` Joseph Turner
2024-08-23  7:58                     ` Joseph Turner
2024-08-22  7:09     ` Andreas Schwab
2024-08-22  7:30       ` Joseph Turner
2024-08-22 11:05         ` Eli Zaretskii
2024-08-26  6:37   ` Joseph Turner
2024-08-26  6:49     ` Joseph Turner
2024-08-26 11:22       ` Eli Zaretskii
2024-08-27  4:48         ` Joseph Turner
2024-08-26 11:20     ` Eli Zaretskii
2024-08-20 11:24 ` Andreas Schwab

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ed6hdnpe.fsf@ushin.org \
    --to=joseph@ushin.org \
    --cc=adam@alphapapa.net \
    --cc=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    --cc=schwab@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.