How to get buffer byte length (not number of characters)?

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* How to get buffer byte length (not number of characters)?
@ 2024-08-20  7:10 Joseph Turner
  2024-08-20  7:51 ` Joseph Turner
                   ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Joseph Turner @ 2024-08-20  7:10 UTC (permalink / raw)
  To: Emacs Devel Mailing List

Hello!

`buffer-size' returns the number of characters in a buffer:

(with-temp-buffer
  (insert "你好")
  (buffer-size)) ;; 2

However, the buffer's byte length may be larger:

(let* ((filename (make-temp-file "buffer-size-test-"))
       (file (with-temp-file filename (insert "你好"))))
  (file-attribute-size (file-attributes filename))) ;; 6

How can I get a buffer's byte length without writing to a file?

Thank you!

Joseph



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-20  7:10 How to get buffer byte length (not number of characters)? Joseph Turner
@ 2024-08-20  7:51 ` Joseph Turner
  2024-08-20 11:20   ` Eli Zaretskii
  2024-08-20 11:15 ` Eli Zaretskii
  2024-08-20 11:24 ` Andreas Schwab
  2 siblings, 1 reply; 40+ messages in thread
From: Joseph Turner @ 2024-08-20  7:51 UTC (permalink / raw)
  To: Emacs Devel Mailing List

Joseph Turner <joseph@ushin.org> writes:

> How can I get a buffer's byte length without writing to a file?

This seems to work:

(with-temp-buffer
  (insert "你好")
  (set-buffer-multibyte nil)
  (buffer-size))  ;; 6

although, curiously, this does not:

(with-temp-buffer
  (set-buffer-multibyte nil)
  (insert "你好")
  (buffer-size))  ;; 2

Is the `set-buffer-multibyte' approach the best solution?

If I have a multibyte string and I want the byte length, do I need to
insert it into a buffer and perform the same dance as above?

Thank you!

Joseph



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-20  7:10 How to get buffer byte length (not number of characters)? Joseph Turner
  2024-08-20  7:51 ` Joseph Turner
@ 2024-08-20 11:15 ` Eli Zaretskii
  2024-08-21  9:20   ` Joseph Turner
  2024-08-26  6:37   ` Joseph Turner
  2024-08-20 11:24 ` Andreas Schwab
  2 siblings, 2 replies; 40+ messages in thread
From: Eli Zaretskii @ 2024-08-20 11:15 UTC (permalink / raw)
  To: Joseph Turner; +Cc: emacs-devel

> From: Joseph Turner <joseph@ushin.org>
> Date: Tue, 20 Aug 2024 00:10:50 -0700
> 
> Hello!
> 
> `buffer-size' returns the number of characters in a buffer:
> 
> (with-temp-buffer
>   (insert "你好")
>   (buffer-size)) ;; 2
> 
> However, the buffer's byte length may be larger:
> 
> (let* ((filename (make-temp-file "buffer-size-test-"))
>        (file (with-temp-file filename (insert "你好"))))
>   (file-attribute-size (file-attributes filename))) ;; 6
> 
> How can I get a buffer's byte length without writing to a file?

This depends on why do you need the byte length of the buffer.

If I interpret your question literally, then this is the answer:

  (position-bytes (point-max))

perhaps preceded by a call to 'widen'.  But that returns the number of
bytes that the buffer's characters take when represented in the
internal Emacs representation of characters, which is not necessarily
useful to Lisp programs.  For example, if you need to know how many
bytes will Emacs write to a file if you save the buffer, or to a
network connection or a sub-process if you send the buffer there, then
you need to consider the encoding process: Emacs always encodes the
buffer text on output to the external world.  If this is what you
want, then you need to use bufferpos-to-filepos, and make sure you
pass the correct coding-system argument to it.

If you need this for something else, please tell the details.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-20  7:51 ` Joseph Turner
@ 2024-08-20 11:20   ` Eli Zaretskii
  0 siblings, 0 replies; 40+ messages in thread
From: Eli Zaretskii @ 2024-08-20 11:20 UTC (permalink / raw)
  To: Joseph Turner; +Cc: emacs-devel

> From: Joseph Turner <joseph@ushin.org>
> Date: Tue, 20 Aug 2024 00:51:18 -0700
> 
> Joseph Turner <joseph@ushin.org> writes:
> 
> > How can I get a buffer's byte length without writing to a file?
> 
> This seems to work:
> 
> (with-temp-buffer
>   (insert "你好")
>   (set-buffer-multibyte nil)
>   (buffer-size))  ;; 6
> 
> although, curiously, this does not:
> 
> (with-temp-buffer
>   (set-buffer-multibyte nil)
>   (insert "你好")
>   (buffer-size))  ;; 2
> 
> Is the `set-buffer-multibyte' approach the best solution?

No, as you already discovered.  Unibyte buffers and strings are messy
and full of surprises, so my suggestion is to stay away of them as
much as you can.

> If I have a multibyte string and I want the byte length, do I need to
> insert it into a buffer and perform the same dance as above?

No, you can use string-bytes instead.  But again: whether the result
is useful for whatever the needs which triggered these questions, is
uncertain, and my crystal ball says that this is not what you want.
For example, raw bytes sometimes take 2 bytes in the internal Emacs
representation, something that will get in the way of most uses of
these results.

So please tell more about the background and the context of these
questions.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-20  7:10 How to get buffer byte length (not number of characters)? Joseph Turner
  2024-08-20  7:51 ` Joseph Turner
  2024-08-20 11:15 ` Eli Zaretskii
@ 2024-08-20 11:24 ` Andreas Schwab
  2 siblings, 0 replies; 40+ messages in thread
From: Andreas Schwab @ 2024-08-20 11:24 UTC (permalink / raw)
  To: Joseph Turner; +Cc: Emacs Devel Mailing List

On Aug 20 2024, Joseph Turner wrote:

> How can I get a buffer's byte length without writing to a file?

That looks like a XY problem.  Why do you need to know that?

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-20 11:15 ` Eli Zaretskii
@ 2024-08-21  9:20   ` Joseph Turner
  2024-08-21 17:47     ` Eli Zaretskii
  2024-08-22  7:09     ` Andreas Schwab
  2024-08-26  6:37   ` Joseph Turner
  1 sibling, 2 replies; 40+ messages in thread
From: Joseph Turner @ 2024-08-21  9:20 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, Andreas Schwab, Adam Porter

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Joseph Turner <joseph@ushin.org>
>> Date: Tue, 20 Aug 2024 00:10:50 -0700
>>
>> How can I get a buffer's byte length without writing to a file?
>
> This depends on why do you need the byte length of the buffer.
>
> If I interpret your question literally, then this is the answer:
>
>   (position-bytes (point-max))
>
> perhaps preceded by a call to 'widen'.  But that returns the number of
> bytes that the buffer's characters take when represented in the
> internal Emacs representation of characters, which is not necessarily
> useful to Lisp programs.  For example, if you need to know how many
> bytes will Emacs write to a file if you save the buffer, or to a
> network connection or a sub-process if you send the buffer there, then
> you need to consider the encoding process: Emacs always encodes the
> buffer text on output to the external world.  If this is what you
> want, then you need to use bufferpos-to-filepos, and make sure you
> pass the correct coding-system argument to it.
>
> If you need this for something else, please tell the details.

Thank you, Eli, Andreas!

Eli's crystal ball is correct: I'd like to know how many bytes Emacs
will send when passing buffer contents (or a string) to a subprocess,
and first I need to figure out which coding system is appropriate.

The hyperdrive.el package provides a UI for creating and accessing
shared virtual filesystems.  hyperdrive.el uses plz.el as an Elisp API
for curl in order to communicate with a local HTTP server.

We want to be able to create hyperdrive "files" in an Emacs buffer and
then upload them with the correct encoding.  We also want to know how
large they will be before uploading them.  A couple of examples:

Let's say I create a textual hyperdrive file using hyperdrive.el, and
then I upload it by sending its contents via curl to the local HTTP
server.  What coding system should be used when the file is uploaded?

Let's say I have a `iso-latin-1'-encoded file "foo.txt" on my local
filesystem.  I upload this encoded file to my hyperdrive by passing the
filename to curl, which uploads the bytes with no conversion.  Then I
open the "foo.txt" hyperdrive file using hyperdrive.el, which receives
the contents via curl from the local HTTP server.  In the hyperdrive
file buffer, buffer-file-coding-system should be `iso-latin-1' (right?).
Then, I edit the buffer and save it to the hyperdrive again with
hyperdrive.el, which this time sends the modified contents over the wire
to curl.  The uploaded file should be `iso-latin-1'-encoded (right?).

Currently, plz.el always creates the curl subprocess like so:

(make-process :coding 'binary ...)

https://git.savannah.gnu.org/cgit/emacs/elpa.git/tree/plz.el?h=externals-release/plz#n519

Does this DTRT?  Should we use buffer-file-coding-system not 'binary?

Thank you for helping me understand encodings in Emacs.

Joseph

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-21  9:20   ` Joseph Turner
@ 2024-08-21 17:47     ` Eli Zaretskii
  2024-08-21 23:52       ` Joseph Turner
  2024-08-22  7:09     ` Andreas Schwab
  1 sibling, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2024-08-21 17:47 UTC (permalink / raw)
  To: Joseph Turner; +Cc: emacs-devel, schwab, adam

> From: Joseph Turner <joseph@ushin.org>
> Cc: emacs-devel@gnu.org, Andreas Schwab <schwab@suse.de>, Adam Porter
>  <adam@alphapapa.net>
> Date: Wed, 21 Aug 2024 02:20:09 -0700
> 
> Let's say I create a textual hyperdrive file using hyperdrive.el, and
> then I upload it by sending its contents via curl to the local HTTP
> server.  What coding system should be used when the file is uploaded?
> 
> Let's say I have a `iso-latin-1'-encoded file "foo.txt" on my local
> filesystem.  I upload this encoded file to my hyperdrive by passing the
> filename to curl, which uploads the bytes with no conversion.  Then I
> open the "foo.txt" hyperdrive file using hyperdrive.el, which receives
> the contents via curl from the local HTTP server.  In the hyperdrive
> file buffer, buffer-file-coding-system should be `iso-latin-1' (right?).

It's what I would expect, yes.  But you can try it yourself, of course
and make sure it is indeed what happens.

> Then, I edit the buffer and save it to the hyperdrive again with
> hyperdrive.el, which this time sends the modified contents over the wire
> to curl.  The uploaded file should be `iso-latin-1'-encoded (right?).

Again, that'd be my expectation.  But it's better to test this
assumption.

> Currently, plz.el always creates the curl subprocess like so:
> 
> (make-process :coding 'binary ...)
> 
> https://git.savannah.gnu.org/cgit/emacs/elpa.git/tree/plz.el?h=externals-release/plz#n519
> 
> Does this DTRT?

It could be TRT if plz.el encodes the buffer text "by hand" before
sending the results to curl and decodes it when it receives text from
curl.  Which I think is what happens there.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-21 17:47     ` Eli Zaretskii
@ 2024-08-21 23:52       ` Joseph Turner
  2024-08-22  4:06         ` Eli Zaretskii
  0 siblings, 1 reply; 40+ messages in thread
From: Joseph Turner @ 2024-08-21 23:52 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, schwab, adam

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Joseph Turner <joseph@ushin.org>
>> Cc: emacs-devel@gnu.org, Andreas Schwab <schwab@suse.de>, Adam Porter
>>  <adam@alphapapa.net>
>> Date: Wed, 21 Aug 2024 02:20:09 -0700
>> 
>> Let's say I create a textual hyperdrive file using hyperdrive.el, and
>> then I upload it by sending its contents via curl to the local HTTP
>> server.  What coding system should be used when the file is uploaded?
>> 
>> Let's say I have a `iso-latin-1'-encoded file "foo.txt" on my local
>> filesystem.  I upload this encoded file to my hyperdrive by passing the
>> filename to curl, which uploads the bytes with no conversion.  Then I
>> open the "foo.txt" hyperdrive file using hyperdrive.el, which receives
>> the contents via curl from the local HTTP server.  In the hyperdrive
>> file buffer, buffer-file-coding-system should be `iso-latin-1' (right?).
>
> It's what I would expect, yes.  But you can try it yourself, of course
> and make sure it is indeed what happens.
>
>> Then, I edit the buffer and save it to the hyperdrive again with
>> hyperdrive.el, which this time sends the modified contents over the wire
>> to curl.  The uploaded file should be `iso-latin-1'-encoded (right?).
>
> Again, that'd be my expectation.  But it's better to test this
> assumption.
>
>> Currently, plz.el always creates the curl subprocess like so:
>> 
>> (make-process :coding 'binary ...)
>> 
>> https://git.savannah.gnu.org/cgit/emacs/elpa.git/tree/plz.el?h=externals-release/plz#n519
>> 
>> Does this DTRT?
>
> It could be TRT if plz.el encodes the buffer text "by hand" before
> sending the results to curl and decodes it when it receives text from
> curl.  Which I think is what happens there.

plz.el does not manually encode buffer text *within Emacs* when sending
requests to curl, but by default, plz.el sends data to curl with --data,
which tells curl to strip CR and newlines.  With the :body-type 'binary
argument, plz.el instead uses --data-binary, which does no conversion.

We don't want to strip newlines from hyperdrive files, so we always use
:body-type 'binary when sending buffer contents.  Should hyperdrive.el
encode data with `buffer-file-coding-system' before passing to plz.el?

When receiving text from curl, plz.el optionally decodes the text
according to the charset in the 'Content-Type' header, e.g., "text/html;
charset=utf-8" or utf-8 if no charset is found.

Perhaps hyperdrive.el should check the 'Content-Type' header charset,
then fallback to guessing the coding system based on filename and file
contents with `set-auto-coding' (to avoid decoding images, etc.), and
then finally fallback to something else?

Thank you!

Joseph

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-21 23:52       ` Joseph Turner
@ 2024-08-22  4:06         ` Eli Zaretskii
  2024-08-22  7:24           ` Joseph Turner
  0 siblings, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2024-08-22  4:06 UTC (permalink / raw)
  To: Joseph Turner; +Cc: emacs-devel, schwab, adam

> From: Joseph Turner <joseph@ushin.org>
> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
> Date: Wed, 21 Aug 2024 16:52:39 -0700
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> Currently, plz.el always creates the curl subprocess like so:
> >> 
> >> (make-process :coding 'binary ...)
> >> 
> >> https://git.savannah.gnu.org/cgit/emacs/elpa.git/tree/plz.el?h=externals-release/plz#n519
> >> 
> >> Does this DTRT?
> >
> > It could be TRT if plz.el encodes the buffer text "by hand" before
> > sending the results to curl and decodes it when it receives text from
> > curl.  Which I think is what happens there.
> 
> plz.el does not manually encode buffer text *within Emacs* when sending
> requests to curl, but by default, plz.el sends data to curl with --data,
> which tells curl to strip CR and newlines.  With the :body-type 'binary
> argument, plz.el instead uses --data-binary, which does no conversion.

Newlines is a relatively minor issue (although it, too, needs to be
considered).  My main concern is with the text encoding.  How can it
be TRT to use 'binary when sending buffer text to curl? that would
mean we are more-or-less always sending the internal representation of
characters, which is superset of UTF-8.  If the data was originally
encoded in anything but UTF-8, reading it into Emacs and then sending
it back will change the byte sequences from that other encoding to
UTF-8.  Moreover, 'binary does not guarantee that the result is valid
UTF-8.

So maybe I misunderstand how these plz.el facilities are used, but up
front this sounds like a mistake.

> We don't want to strip newlines from hyperdrive files, so we always use
> :body-type 'binary when sending buffer contents.  Should hyperdrive.el
> encode data with `buffer-file-coding-system' before passing to plz.el?

I would think so, but maybe we should bring the plz.el developers on
board of this discussion.

> When receiving text from curl, plz.el optionally decodes the text
> according to the charset in the 'Content-Type' header, e.g., "text/html;
> charset=utf-8" or utf-8 if no charset is found.

By "optionally" you mean that it doesn't always happen, except if the
caller requests that?  If so, the caller of plz.el should decode the
text manually before using it in user-facing features.

> Perhaps hyperdrive.el should check the 'Content-Type' header charset,
> then fallback to guessing the coding system based on filename and file
> contents with `set-auto-coding' (to avoid decoding images, etc.), and
> then finally fallback to something else?

Probably.  But then I don't know anything about hyperdrive.el, either.
If it copies text between files or URLs without showing it to the
user, then the best strategy is indeed not to decode and encode stuff,
but handle it as a stream of raw bytes.  (In that case, my suggestion
would be to use unibyte buffers and strings for temporarily storing
and processing these raw bytes in Emacs.)  But if the text is somehow
shown to the user, it must be decoded to be displayed correctly by
Emacs.  And then it must be encoded back when writing it back to the
external storage.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-21  9:20   ` Joseph Turner
  2024-08-21 17:47     ` Eli Zaretskii
@ 2024-08-22  7:09     ` Andreas Schwab
  2024-08-22  7:30       ` Joseph Turner
  1 sibling, 1 reply; 40+ messages in thread
From: Andreas Schwab @ 2024-08-22  7:09 UTC (permalink / raw)
  To: Joseph Turner; +Cc: Eli Zaretskii, emacs-devel, Adam Porter

On Aug 21 2024, Joseph Turner wrote:

> Let's say I have a `iso-latin-1'-encoded file "foo.txt" on my local
> filesystem.  I upload this encoded file to my hyperdrive by passing the
> filename to curl, which uploads the bytes with no conversion.  Then I
> open the "foo.txt" hyperdrive file using hyperdrive.el, which receives
> the contents via curl from the local HTTP server.  In the hyperdrive
> file buffer, buffer-file-coding-system should be `iso-latin-1' (right?).

That depends on the coding system priorities.  Since latin-1 cannot be
identified unambiguously, only the priority can distinguish it from
other 8-bit coding systems.  Also, if the file contains only ASCII
characters, buffer-file-coding-system will be set to undecided.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-22  4:06         ` Eli Zaretskii
@ 2024-08-22  7:24           ` Joseph Turner
  2024-08-22 11:04             ` Eli Zaretskii
  2024-08-22 12:26             ` Adam Porter
  0 siblings, 2 replies; 40+ messages in thread
From: Joseph Turner @ 2024-08-22  7:24 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, schwab, adam

[-- Attachment #1: Type: text/plain, Size: 4002 bytes --]

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Joseph Turner <joseph@ushin.org>
>> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
>> Date: Wed, 21 Aug 2024 16:52:39 -0700
>> 
>> Eli Zaretskii <eliz@gnu.org> writes:
>> 
>> >> Currently, plz.el always creates the curl subprocess like so:
>> >> 
>> >> (make-process :coding 'binary ...)
>> >> 
>> >> https://git.savannah.gnu.org/cgit/emacs/elpa.git/tree/plz.el?h=externals-release/plz#n519
>> >> 
>> >> Does this DTRT?
>> >
>> > It could be TRT if plz.el encodes the buffer text "by hand" before
>> > sending the results to curl and decodes it when it receives text from
>> > curl.  Which I think is what happens there.
>> 
>> plz.el does not manually encode buffer text *within Emacs* when sending
>> requests to curl, but by default, plz.el sends data to curl with --data,
>> which tells curl to strip CR and newlines.  With the :body-type 'binary
>> argument, plz.el instead uses --data-binary, which does no conversion.
>
> Newlines is a relatively minor issue (although it, too, needs to be
> considered).  My main concern is with the text encoding.  How can it
> be TRT to use 'binary when sending buffer text to curl? that would
> mean we are more-or-less always sending the internal representation of
> characters, which is superset of UTF-8.  If the data was originally
> encoded in anything but UTF-8, reading it into Emacs and then sending
> it back will change the byte sequences from that other encoding to
> UTF-8.  Moreover, 'binary does not guarantee that the result is valid
> UTF-8.
>
> So maybe I misunderstand how these plz.el facilities are used, but up
> front this sounds like a mistake.

It could be.  Eli, Adam, what do you think about the default coding
systems for encoding the request body in the attached patch?

>> We don't want to strip newlines from hyperdrive files, so we always use
>> :body-type 'binary when sending buffer contents.  Should hyperdrive.el
>> encode data with `buffer-file-coding-system' before passing to plz.el?
>
> I would think so, but maybe we should bring the plz.el developers on
> board of this discussion.

I've CC'd Adam.

>> When receiving text from curl, plz.el optionally decodes the text
>> according to the charset in the 'Content-Type' header, e.g., "text/html;
>> charset=utf-8" or utf-8 if no charset is found.
>
> By "optionally" you mean that it doesn't always happen, except if the
> caller requests that?  If so, the caller of plz.el should decode the
> text manually before using it in user-facing features.

By default, `plz' decodes response body according to the 'Content-Type'
charset (or utf-8 as fallback).  Passing `:decode nil' stops that.

>> Perhaps hyperdrive.el should check the 'Content-Type' header charset,
>> then fallback to guessing the coding system based on filename and file
>> contents with `set-auto-coding' (to avoid decoding images, etc.), and
>> then finally fallback to something else?
>
> Probably.  But then I don't know anything about hyperdrive.el, either.
> If it copies text between files or URLs without showing it to the
> user, then the best strategy is indeed not to decode and encode stuff,
> but handle it as a stream of raw bytes.  (In that case, my suggestion
> would be to use unibyte buffers and strings for temporarily storing
> and processing these raw bytes in Emacs.)  But if the text is somehow
> shown to the user, it must be decoded to be displayed correctly by
> Emacs.  And then it must be encoded back when writing it back to the
> external storage.

Thanks!  Good to know about unibyte buffers and strings for that.

hyperdrive.el does show text to the user, so we'll likely do something
like what I described above.  What fallback encoding should we use if
there's no 'Content-Type' charset and `set-auto-coding' returns nil?
IIUC, there's no foolproof way to guess the encoding of unknown bytes.

default-file-name-coding-system?

Thank you!!  I feel more solid in my understanding of encodings now.

Joseph


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Add-plz-BODY-CODING-argument-Add-default-encoding.patch --]
[-- Type: text/x-diff, Size: 3377 bytes --]

From a684ff680ab05f359b628623159b4d3392eb448e Mon Sep 17 00:00:00 2001
From: Joseph Turner <joseph@breatheoutbreathe.in>
Date: Thu, 22 Aug 2024 00:02:19 -0700
Subject: [PATCH] Add: (plz) BODY-CODING argument; Add default encoding

Previously, strings and buffers were sent to curl as the internal
Emacs representation.  Now strings and buffers are encoded, and the
BODY-CODING argument can used to override the default coding systems.
---
 plz.el | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/plz.el b/plz.el
index 903d71e..91d41d2 100644
--- a/plz.el
+++ b/plz.el
@@ -323,7 +323,7 @@ (defalias 'plz--generate-new-buffer
 
 ;;;;; Public
 
-(cl-defun plz (method url &rest rest &key headers body else filter finally noquery timeout
+(cl-defun plz (method url &rest rest &key headers body else filter finally noquery timeout body-coding
                       (as 'string) (then 'sync)
                       (body-type 'text) (decode t decode-s)
                       (connect-timeout plz-connect-timeout))
@@ -340,6 +340,13 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
 BODY-TYPE may be `text' to send BODY as text, or `binary' to send
 it as binary.
 
+BODY-CODING may a coding system used to encode BODY before
+passing it to curl.  BODY-CODING has no effect when BODY is a
+list like `(file FILENAME)'.  If nil and BODY is a string, the
+default process I/O output coding system is used.  If nil and
+BODY is a buffer, the buffer-local value of
+`buffer-file-coding-system' is used.
+
 AS selects the kind of result to pass to the callback function
 THEN, or the kind of result to return for synchronous requests.
 It may be:
@@ -416,6 +423,19 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
   (declare (indent defun))
   (setf decode (if (and decode-s (not decode))
                    nil decode))
+  (unless body-coding
+    (pcase-exhaustive body
+      (`(file ,filename)
+       ;; Don't set BODY-CODING; files are passed as-is to curl.
+       (setf body-coding nil))
+      ((pred stringp)
+       ;; Use default output coding for processes.
+       (setf body-coding (cdr default-process-coding-system)))
+      ((and (pred bufferp) buffer)
+       ;; Use buffer-local coding.
+       (setf body-coding
+             (buffer-local-value 'buffer-file-coding-system buffer)))))
+
   ;; NOTE: By default, for PUT requests and POST requests >1KB, curl sends an
   ;; "Expect:" header, which causes servers to send a "100 Continue" response, which
   ;; we don't want to have to deal with, so we disable it by setting the header to
@@ -553,8 +573,11 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
     (process-send-string process curl-config)
     (when body
       (cl-typecase body
-        (string (process-send-string process body))
-        (buffer (with-current-buffer body
+        (string (process-send-string
+                 process (encode-coding-string body body-coding t)))
+        (buffer (with-temp-buffer
+                  (insert-buffer-substring-no-properties body)
+                  (encode-coding-region (point-min) (point-max) body-coding)
                   (process-send-region process (point-min) (point-max))))))
     (process-send-eof process)
     (if sync-p
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-22  7:09     ` Andreas Schwab
@ 2024-08-22  7:30       ` Joseph Turner
  2024-08-22 11:05         ` Eli Zaretskii
  0 siblings, 1 reply; 40+ messages in thread
From: Joseph Turner @ 2024-08-22  7:30 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Eli Zaretskii, emacs-devel, Adam Porter

Andreas Schwab <schwab@suse.de> writes:

> On Aug 21 2024, Joseph Turner wrote:
>
>> Let's say I have a `iso-latin-1'-encoded file "foo.txt" on my local
>> filesystem.  I upload this encoded file to my hyperdrive by passing the
>> filename to curl, which uploads the bytes with no conversion.  Then I
>> open the "foo.txt" hyperdrive file using hyperdrive.el, which receives
>> the contents via curl from the local HTTP server.  In the hyperdrive
>> file buffer, buffer-file-coding-system should be `iso-latin-1' (right?).
>
> That depends on the coding system priorities.  Since latin-1 cannot be
> identified unambiguously, only the priority can distinguish it from
> other 8-bit coding systems.  Also, if the file contains only ASCII
> characters, buffer-file-coding-system will be set to undecided.

Thank you!  I think you just answered my question to Eli:

> What fallback encoding should we use if there's no 'Content-Type'
> charset and `set-auto-coding' returns nil?  IIUC, there's no foolproof
> way to guess the encoding of unknown bytes.

Should we use (coding-system-priority-list t) as the final fallback?

Joseph




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-22  7:24           ` Joseph Turner
@ 2024-08-22 11:04             ` Eli Zaretskii
  2024-08-22 18:29               ` Joseph Turner
  2024-08-22 12:26             ` Adam Porter
  1 sibling, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2024-08-22 11:04 UTC (permalink / raw)
  To: Joseph Turner; +Cc: emacs-devel, schwab, adam

> From: Joseph Turner <joseph@ushin.org>
> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
> Date: Thu, 22 Aug 2024 00:24:45 -0700
> 
> > So maybe I misunderstand how these plz.el facilities are used, but up
> > front this sounds like a mistake.
> 
> It could be.  Eli, Adam, what do you think about the default coding
> systems for encoding the request body in the attached patch?

I think it is better to use detect-coding-region instead, if
buffer-file-coding-system is undecided.

> > By "optionally" you mean that it doesn't always happen, except if the
> > caller requests that?  If so, the caller of plz.el should decode the
> > text manually before using it in user-facing features.
> 
> By default, `plz' decodes response body according to the 'Content-Type'
> charset (or utf-8 as fallback).  Passing `:decode nil' stops that.

Sounds correct.

> default-file-name-coding-system?

That's for file names, so it is not what you want here.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-22  7:30       ` Joseph Turner
@ 2024-08-22 11:05         ` Eli Zaretskii
  0 siblings, 0 replies; 40+ messages in thread
From: Eli Zaretskii @ 2024-08-22 11:05 UTC (permalink / raw)
  To: Joseph Turner; +Cc: schwab, emacs-devel, adam

> From: Joseph Turner <joseph@ushin.org>
> Cc: Eli Zaretskii <eliz@gnu.org>,  emacs-devel@gnu.org,  Adam Porter
>  <adam@alphapapa.net>
> Date: Thu, 22 Aug 2024 00:30:51 -0700
> 
> Should we use (coding-system-priority-list t) as the final fallback?

detect-coding-region already does that.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-22  7:24           ` Joseph Turner
  2024-08-22 11:04             ` Eli Zaretskii
@ 2024-08-22 12:26             ` Adam Porter
  2024-08-22 12:47               ` tomas
  2024-08-22 13:50               ` Eli Zaretskii
  1 sibling, 2 replies; 40+ messages in thread
From: Adam Porter @ 2024-08-22 12:26 UTC (permalink / raw)
  To: Joseph Turner, Eli Zaretskii; +Cc: emacs-devel, schwab

Hi Joseph, et al,

On 8/22/24 02:24, Joseph Turner wrote:

>>> plz.el does not manually encode buffer text *within Emacs* when sending
>>> requests to curl, but by default, plz.el sends data to curl with --data,
>>> which tells curl to strip CR and newlines.  With the :body-type 'binary
>>> argument, plz.el instead uses --data-binary, which does no conversion.
>>
>> Newlines is a relatively minor issue (although it, too, needs to be
>> considered).  My main concern is with the text encoding.  How can it
>> be TRT to use 'binary when sending buffer text to curl? that would
>> mean we are more-or-less always sending the internal representation of
>> characters, which is superset of UTF-8.  If the data was originally
>> encoded in anything but UTF-8, reading it into Emacs and then sending
>> it back will change the byte sequences from that other encoding to
>> UTF-8.  Moreover, 'binary does not guarantee that the result is valid
>> UTF-8.
>>
>> So maybe I misunderstand how these plz.el facilities are used, but up
>> front this sounds like a mistake.
> 
> It could be.  Eli, Adam, what do you think about the default coding
> systems for encoding the request body in the attached patch?

 From an API perspective, I'm not sure.  My idea for plz.el is to 
provide a simple, somewhat idiomatic Elisp API for making HTTP requests 
(and, of course, to make "correct" requests, in compliance with 
specifications and expectations).  Given the relatively few clients of 
plz thus far, some issues are yet to be fully explored and developed, 
and encoding/decoding may be one of those rougher edges.  For the use 
cases I'm aware of, it seems to work well and correctly, but there are 
undoubtedly improvements to be made.

Encoding/decoding is not exactly a simple matter, especially with regard 
to API design.  Ultimately, no library can abstract it away from users' 
need to understand it.  And I want plz's API to not have to change any 
more than necessary over time, so I'd want to be very deliberate with 
any changes to it.  So it's appealing to do as little as possible in 
this regard, leaving as much as possible to the upstream user to handle 
outside of plz.

One way to do that is to do what hyperdrive.el is basically doing now, 
to tell plz to tell curl to handle the data as binary, i.e. to pass it 
through unchanged.  But it seems that we haven't covered all of the 
bases with regard to these issues; rather, we have tested a subset of 
them that seem to work as expected.

Also, where it's possible to make plz DTRT automatically, integrating 
naturally with Elisp APIs and data structures, I'm certainly in favor of 
that.  So, e.g. automatically using a buffer's expected encoding when 
passing its data to curl seems like the right thing to do, which plz 
doesn't do yet (and perhaps we could do the same thing when returning a 
buffer of data).

Of course, AFAIK we can't do such a thing when passing a string, so I 
guess the most we can do there is document recommended patterns for the 
user; IOW I'm tempted to leave encoding of strings to the user rather 
than add another argument for that, but we can talk about it.

Thanks,
Adam

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-22 12:26             ` Adam Porter
@ 2024-08-22 12:47               ` tomas
  2024-08-23  6:28                 ` Adam Porter
  2024-08-22 13:50               ` Eli Zaretskii
  1 sibling, 1 reply; 40+ messages in thread
From: tomas @ 2024-08-22 12:47 UTC (permalink / raw)
  To: Adam Porter; +Cc: Joseph Turner, Eli Zaretskii, emacs-devel, schwab

[-- Attachment #1: Type: text/plain, Size: 1280 bytes --]

On Thu, Aug 22, 2024 at 07:26:58AM -0500, Adam Porter wrote:

[...]

> From an API perspective, I'm not sure.  My idea for plz.el is to provide a
> simple, somewhat idiomatic Elisp API for making HTTP requests (and, of
> course, to make "correct" requests, in compliance with specifications and
> expectations).  Given the relatively few clients of plz thus far, some
> issues are yet to be fully explored and developed, and encoding/decoding may
> be one of those rougher edges.  For the use cases I'm aware of, it seems to
> work well and correctly, but there are undoubtedly improvements to be made.

Another point I haven't seen in this discussion is that HTTP also may
carry metadata about what it thinks the content encoding is. This may
involve the server configuration too.

You may choose to ignore it, but then you need to convince all the
moving parts to agree on that (i.e. an Apache on the other side might
happily tell your client that it is sending "text/plain; charset=UTF-8"
or something similarly funny (note: UTF-8 isn't a charset :-) depending
on the web server config, the content of some mimetype database and
so on.

You'll have to make sure curl and Emacs do the right thing with that
(which might be "nothing").

Cheers
-- 
t

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-22 12:26             ` Adam Porter
  2024-08-22 12:47               ` tomas
@ 2024-08-22 13:50               ` Eli Zaretskii
  2024-08-23  6:31                 ` Adam Porter
  1 sibling, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2024-08-22 13:50 UTC (permalink / raw)
  To: Adam Porter; +Cc: joseph, emacs-devel, schwab

> Date: Thu, 22 Aug 2024 07:26:58 -0500
> Cc: emacs-devel@gnu.org, schwab@suse.de
> From: Adam Porter <adam@alphapapa.net>
> 
> > It could be.  Eli, Adam, what do you think about the default coding
> > systems for encoding the request body in the attached patch?
> 
>  From an API perspective, I'm not sure.  My idea for plz.el is to 
> provide a simple, somewhat idiomatic Elisp API for making HTTP requests 
> (and, of course, to make "correct" requests, in compliance with 
> specifications and expectations).  Given the relatively few clients of 
> plz thus far, some issues are yet to be fully explored and developed, 
> and encoding/decoding may be one of those rougher edges.  For the use 
> cases I'm aware of, it seems to work well and correctly, but there are 
> undoubtedly improvements to be made.
> 
> Encoding/decoding is not exactly a simple matter, especially with regard 
> to API design.  Ultimately, no library can abstract it away from users' 
> need to understand it.  And I want plz's API to not have to change any 
> more than necessary over time, so I'd want to be very deliberate with 
> any changes to it.  So it's appealing to do as little as possible in 
> this regard, leaving as much as possible to the upstream user to handle 
> outside of plz.

But AFAICT, plz.el does decode the stuff it gets from curl, which
doesn't seem to be consistent with what you say above.  If plz.el
would accept unibyte text and return unibyte text, that would be
consistent: it would mean that any callers of plz.el need to do
encoding and decoding themselves.  But that doesn't seem to be the
case now.

Am I missing something?



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-22 11:04             ` Eli Zaretskii
@ 2024-08-22 18:29               ` Joseph Turner
  2024-08-22 18:44                 ` Eli Zaretskii
  0 siblings, 1 reply; 40+ messages in thread
From: Joseph Turner @ 2024-08-22 18:29 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, schwab, adam

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Joseph Turner <joseph@ushin.org>
>> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
>> Date: Thu, 22 Aug 2024 00:24:45 -0700
>> 
>> > So maybe I misunderstand how these plz.el facilities are used, but up
>> > front this sounds like a mistake.
>> 
>> It could be.  Eli, Adam, what do you think about the default coding
>> systems for encoding the request body in the attached patch?
>
> I think it is better to use detect-coding-region instead, if
> buffer-file-coding-system is undecided.

detect-coding-region is only useful when decoding text, right?

For encoding text, should we encode with buffer-file-coding-system?

>> > By "optionally" you mean that it doesn't always happen, except if the
>> > caller requests that?  If so, the caller of plz.el should decode the
>> > text manually before using it in user-facing features.
>> 
>> By default, `plz' decodes response body according to the 'Content-Type'
>> charset (or utf-8 as fallback).  Passing `:decode nil' stops that.
>
> Sounds correct.

When decoding, should plz fallback to detect-coding-region instead of utf-8?

>> default-file-name-coding-system?
>
> That's for file names, so it is not what you want here.

Thanks!  So when decoding text in hyperdrive.el, we can use (1)
Content-Type charset, or (2) use `detect-coding-region' as a fallback.

IIUC, there's no need to use `set-auto-coding', since
`detect-coding-region' DTRT.

Joseph



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-22 18:29               ` Joseph Turner
@ 2024-08-22 18:44                 ` Eli Zaretskii
  2024-08-22 19:32                   ` tomas
  2024-08-23  3:56                   ` Joseph Turner
  0 siblings, 2 replies; 40+ messages in thread
From: Eli Zaretskii @ 2024-08-22 18:44 UTC (permalink / raw)
  To: Joseph Turner; +Cc: emacs-devel, schwab, adam

> From: Joseph Turner <joseph@ushin.org>
> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
> Date: Thu, 22 Aug 2024 11:29:48 -0700
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > I think it is better to use detect-coding-region instead, if
> > buffer-file-coding-system is undecided.
> 
> detect-coding-region is only useful when decoding text, right?

Yes, sorry.  I should have said find-coding-systems-region.

> For encoding text, should we encode with buffer-file-coding-system?

If you are sure it will do, yes.  But what if the buffer started as
all-ASCII and then the user or some Lisp program added some non-ASCII
characters before saving?  Then buffer-file-coding-system is no longer
pertinent.

> >> > By "optionally" you mean that it doesn't always happen, except if the
> >> > caller requests that?  If so, the caller of plz.el should decode the
> >> > text manually before using it in user-facing features.
> >> 
> >> By default, `plz' decodes response body according to the 'Content-Type'
> >> charset (or utf-8 as fallback).  Passing `:decode nil' stops that.
> >
> > Sounds correct.
> 
> When decoding, should plz fallback to detect-coding-region instead of utf-8?

If this is HTML, then I think it is okay to trust the headers about
the charset and default to UTF-8.  The problem with
detect-coding-region is that some of it is based on guesswork, which
is one reason why it could take a UTF-8 encoded text to be Latin-1.
So if a more reliable source of information is available, we had
better used it.

> Thanks!  So when decoding text in hyperdrive.el, we can use (1)
> Content-Type charset, or (2) use `detect-coding-region' as a fallback.

That's also possible.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-22 18:44                 ` Eli Zaretskii
@ 2024-08-22 19:32                   ` tomas
  2024-08-23  3:56                   ` Joseph Turner
  1 sibling, 0 replies; 40+ messages in thread
From: tomas @ 2024-08-22 19:32 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Joseph Turner, emacs-devel, schwab, adam

[-- Attachment #1: Type: text/plain, Size: 936 bytes --]

On Thu, Aug 22, 2024 at 09:44:04PM +0300, Eli Zaretskii wrote:
> > From: Joseph Turner <joseph@ushin.org>

[...]

> > When decoding, should plz fallback to detect-coding-region instead of utf-8?
> 
> If this is HTML, then I think it is okay to trust the headers about
> the charset and default to UTF-8.  The problem with
> detect-coding-region is that some of it is based on guesswork [...]

Yes, and it's incredibly crude guesswork at times. Talk to the server
admin.

With HTML and friends, you get one or two layers of fun, because they
can declare the encoding /whithin/ the stream (HTML in two different
ways, at least). If the "outer layer" decides to helpfully recode,
then the inner declarations are lying (I actually had this with HTML
mails: the MIME layer recoded Latin-1 to UTF-8, the tag
<meta charset="iso-8859-1"> in there was a lie.

Needless to say, html2text made mojibake :-)

Cheers
-- 
t

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-22 18:44                 ` Eli Zaretskii
  2024-08-22 19:32                   ` tomas
@ 2024-08-23  3:56                   ` Joseph Turner
  2024-08-23  7:02                     ` Eli Zaretskii
  2024-08-24  6:14                     ` Joseph Turner
  1 sibling, 2 replies; 40+ messages in thread
From: Joseph Turner @ 2024-08-23  3:56 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, schwab, adam

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Joseph Turner <joseph@ushin.org>
>> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
>> Date: Thu, 22 Aug 2024 11:29:48 -0700
>> 
>> Eli Zaretskii <eliz@gnu.org> writes:
>> 
>> > I think it is better to use detect-coding-region instead, if
>> > buffer-file-coding-system is undecided.
>> 
>> detect-coding-region is only useful when decoding text, right?
>
> Yes, sorry.  I should have said find-coding-systems-region.
>
>> For encoding text, should we encode with buffer-file-coding-system?
>
> If you are sure it will do, yes.  But what if the buffer started as
> all-ASCII and then the user or some Lisp program added some non-ASCII
> characters before saving?  Then buffer-file-coding-system is no longer
> pertinent.

I understand.  Thank you!

How do we encode if find-coding-systems-region returns '(undecided)?

>> >> > By "optionally" you mean that it doesn't always happen, except if the
>> >> > caller requests that?  If so, the caller of plz.el should decode the
>> >> > text manually before using it in user-facing features.
>> >> 
>> >> By default, `plz' decodes response body according to the 'Content-Type'
>> >> charset (or utf-8 as fallback).  Passing `:decode nil' stops that.
>> >
>> > Sounds correct.
>> 
>> When decoding, should plz fallback to detect-coding-region instead of utf-8?
>
> If this is HTML, then I think it is okay to trust the headers about
> the charset and default to UTF-8.  The problem with
> detect-coding-region is that some of it is based on guesswork, which
> is one reason why it could take a UTF-8 encoded text to be Latin-1.
> So if a more reliable source of information is available, we had
> better used it.

Andreas says:

> Yes, and it's incredibly crude guesswork at times. Talk to the server
> admin.

With hyperdrive p2p file sharing, there is no server admin.  😉

Ideally, when users PUT a file into a hyperdrive, hyperdrive.el would
encode the buffer with:

(car (find-coding-systems-region (point-min) (point-max)))

and then send the coding system along with the file in the PUT request.
The coding system would be stored with the hyperdrive file metadata, for
other users to load along with the file contents.  On the other end of
the network, hyperdrive.el would use

(decode-coding-region (point-min) (point-max) CODING-FROM-HYPERDRIVE-METADATA)

However AFAIK, there's no specified or de facto standard for storing
coding metadata in a hyperdrive, so this approach requires deliberation
first.  I've made an issue on the `hypercore-fetch` repository:

https://github.com/RangerMauve/hypercore-fetch/issues/100

For now, we'll rely on detect-coding-region for decoding.

Thanks!

Joseph



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-22 12:47               ` tomas
@ 2024-08-23  6:28                 ` Adam Porter
  0 siblings, 0 replies; 40+ messages in thread
From: Adam Porter @ 2024-08-23  6:28 UTC (permalink / raw)
  To: tomas; +Cc: Joseph Turner, Eli Zaretskii, emacs-devel, schwab

On 8/22/24 07:47, tomas@tuxteam.de wrote:

> Another point I haven't seen in this discussion is that HTTP also may
> carry metadata about what it thinks the content encoding is. This may
> involve the server configuration too.
> 
> You may choose to ignore it, but then you need to convince all the
> moving parts to agree on that (i.e. an Apache on the other side might
> happily tell your client that it is sending "text/plain; charset=UTF-8"
> or something similarly funny (note: UTF-8 isn't a charset :-) depending
> on the web server config, the content of some mimetype database and
> so on.
> 
> You'll have to make sure curl and Emacs do the right thing with that
> (which might be "nothing").

plz optionally decodes response bodies according to the content-type 
header, depending on the :decode argument (although Joseph found a bug 
in the implementation of that, and I'll merge his fix soon).  plz does 
not do anything else, e.g. setting a buffer's coding system variables. 
If :decode is nil, plz does no decoding and leaves it to the user.

For HTML, with its potential for having a META tag that may also specify 
encoding, that would be up to the user; plz offers no special support 
for various content types.

If a user wants to handle these issues manually, the ":as 'response" 
argument to plz should be used, which allows the user to process the 
response headers and body directly.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-22 13:50               ` Eli Zaretskii
@ 2024-08-23  6:31                 ` Adam Porter
  2024-08-23  6:51                   ` Eli Zaretskii
  2024-08-23  7:07                   ` Joseph Turner
  0 siblings, 2 replies; 40+ messages in thread
From: Adam Porter @ 2024-08-23  6:31 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: joseph, emacs-devel, schwab

On 8/22/24 08:50, Eli Zaretskii wrote:

> But AFAICT, plz.el does decode the stuff it gets from curl, which
> doesn't seem to be consistent with what you say above.  If plz.el
> would accept unibyte text and return unibyte text, that would be
> consistent: it would mean that any callers of plz.el need to do
> encoding and decoding themselves.  But that doesn't seem to be the
> case now.
> 
> Am I missing something?

Yes, the :decode argument to plz.  If :decode is nil (or if ":as 
'binary" is specified, which sets :decode to nil), plz does not decode 
the response body.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-23  6:31                 ` Adam Porter
@ 2024-08-23  6:51                   ` Eli Zaretskii
  2024-08-23  7:07                   ` Joseph Turner
  1 sibling, 0 replies; 40+ messages in thread
From: Eli Zaretskii @ 2024-08-23  6:51 UTC (permalink / raw)
  To: Adam Porter; +Cc: joseph, emacs-devel, schwab

> Date: Fri, 23 Aug 2024 01:31:16 -0500
> Cc: joseph@ushin.org, emacs-devel@gnu.org, schwab@suse.de
> From: Adam Porter <adam@alphapapa.net>
> 
> On 8/22/24 08:50, Eli Zaretskii wrote:
> 
> > But AFAICT, plz.el does decode the stuff it gets from curl, which
> > doesn't seem to be consistent with what you say above.  If plz.el
> > would accept unibyte text and return unibyte text, that would be
> > consistent: it would mean that any callers of plz.el need to do
> > encoding and decoding themselves.  But that doesn't seem to be the
> > case now.
> > 
> > Am I missing something?
> 
> Yes, the :decode argument to plz.  If :decode is nil (or if ":as 
> 'binary" is specified, which sets :decode to nil), plz does not decode 
> the response body.

That is obviously NOT the part I was missing...



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-23  3:56                   ` Joseph Turner
@ 2024-08-23  7:02                     ` Eli Zaretskii
  2024-08-23  7:37                       ` Joseph Turner
  2024-08-23  7:43                       ` Joseph Turner
  2024-08-24  6:14                     ` Joseph Turner
  1 sibling, 2 replies; 40+ messages in thread
From: Eli Zaretskii @ 2024-08-23  7:02 UTC (permalink / raw)
  To: Joseph Turner; +Cc: emacs-devel, schwab, adam

> From: Joseph Turner <joseph@ushin.org>
> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
> Date: Thu, 22 Aug 2024 20:56:19 -0700
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> For encoding text, should we encode with buffer-file-coding-system?
> >
> > If you are sure it will do, yes.  But what if the buffer started as
> > all-ASCII and then the user or some Lisp program added some non-ASCII
> > characters before saving?  Then buffer-file-coding-system is no longer
> > pertinent.
> 
> I understand.  Thank you!
> 
> How do we encode if find-coding-systems-region returns '(undecided)?

Use buffer-file-coding-system.

If this is an interactive command, you could also use
select-safe-coding-system, which calls find-coding-systems-region
internally, and also has complex logic for finding suitable callbacks
and asking the user to select an encoding if it fails to find
something suitable.  But this is not appropriate in non-interactive
code.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-23  6:31                 ` Adam Porter
  2024-08-23  6:51                   ` Eli Zaretskii
@ 2024-08-23  7:07                   ` Joseph Turner
  2024-08-23  7:58                     ` Joseph Turner
  1 sibling, 1 reply; 40+ messages in thread
From: Joseph Turner @ 2024-08-23  7:07 UTC (permalink / raw)
  To: Adam Porter; +Cc: Eli Zaretskii, emacs-devel, schwab

[-- Attachment #1: Type: text/plain, Size: 1140 bytes --]

Adam Porter <adam@alphapapa.net> writes:

> On 8/22/24 08:50, Eli Zaretskii wrote:
>
>> But AFAICT, plz.el does decode the stuff it gets from curl, which
>> doesn't seem to be consistent with what you say above.  If plz.el
>> would accept unibyte text and return unibyte text, that would be
>> consistent: it would mean that any callers of plz.el need to do
>> encoding and decoding themselves.  But that doesn't seem to be the
>> case now.
>> Am I missing something?
>
> Yes, the :decode argument to plz.  If :decode is nil (or if ":as
> 'binary" is specified, which sets :decode to nil), plz does not decode
> the response body.

Currently, GET decodes by default while PUT does no encoding by default.
IIUC, the suggestion is that GET and PUT requests either both handle
coding by default or neither does by default.

Currently, PUT requests which pass an unencoded buffer with multibyte
characters currently send the internal Emacs multibyte representation.
I'd be in favor of adding some automatic encoding handling for PUT
requests so that most users don't have to think about it.

Please see patch!  (not tested yet)

Best,

Joseph


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Add-plz-ENCODE-argument-Add-default-encoding.patch --]
[-- Type: text/x-diff, Size: 3932 bytes --]

From 6a8f13fa799f4ba8b64effe229379dc54ef19c91 Mon Sep 17 00:00:00 2001
From: Joseph Turner <joseph@breatheoutbreathe.in>
Date: Thu, 22 Aug 2024 00:02:19 -0700
Subject: [PATCH] Add: (plz) ENCODE argument; Add default encoding

Previously, PUT requests which pass an unencoded string or buffer with
multibyte characters sent the internal Emacs multibyte representation.

Now strings and buffers are encoded by default, and the ENCODE nil
argument (or :BODY-TYPE 'binary) can used when the user wants to
handle encoding.

WIP
---
 plz.el | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/plz.el b/plz.el
index 903d71e..ffbbe0b 100644
--- a/plz.el
+++ b/plz.el
@@ -325,7 +325,8 @@ (defalias 'plz--generate-new-buffer
 
 (cl-defun plz (method url &rest rest &key headers body else filter finally noquery timeout
                       (as 'string) (then 'sync)
-                      (body-type 'text) (decode t decode-s)
+                      (body-type 'text) (encode t encode-s)
+                      (decode t decode-s)
                       (connect-timeout plz-connect-timeout))
   "Request METHOD from URL with curl.
 Return the curl process object or, for a synchronous request, the
@@ -340,6 +341,11 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
 BODY-TYPE may be `text' to send BODY as text, or `binary' to send
 it as binary.
 
+If ENCODE is non-nil, BODY is encoded automatically.  For binary
+content, it should be nil.  When BODY-TYPE is `binary', ENCODE is
+automatically set to nil.  ENCODE has no effect when BODY is a
+list like `(file FILENAME)'.
+
 AS selects the kind of result to pass to the callback function
 THEN, or the kind of result to return for synchronous requests.
 It may be:
@@ -416,6 +422,8 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
   (declare (indent defun))
   (setf decode (if (and decode-s (not decode))
                    nil decode))
+  (setf encode (if (and encode-s (not encode))
+                   nil encode))
   ;; NOTE: By default, for PUT requests and POST requests >1KB, curl sends an
   ;; "Expect:" header, which causes servers to send a "100 Continue" response, which
   ;; we don't want to have to deal with, so we disable it by setting the header to
@@ -473,6 +481,9 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
          (decode (pcase as
                    ('binary nil)
                    (_ decode)))
+         (encode (pcase body-type
+                   ('binary nil)
+                   (_ encode)))
          (default-directory
           ;; Avoid making process in a nonexistent directory (in case the current
           ;; default-directory has since been removed).  It's unclear what the best
@@ -553,9 +564,18 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
     (process-send-string process curl-config)
     (when body
       (cl-typecase body
-        (string (process-send-string process body))
-        (buffer (with-current-buffer body
-                  (process-send-region process (point-min) (point-max))))))
+        (string (process-send-string
+                 process (if encode
+                             (encode-coding-string
+                              body (cdr default-process-coding-system))
+                           body)))
+        (buffer (if encode
+                    (with-temp-buffer
+                      (insert-buffer-substring-no-properties body)
+                      (encode-coding-region (point-min) (point-max) body-coding)
+                      (process-send-region process (point-min) (point-max)))
+                  (with-current-buffer body
+                    (process-send-region process (point-min) (point-max)))))))
     (process-send-eof process)
     (if sync-p
         (unwind-protect
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-23  7:02                     ` Eli Zaretskii
@ 2024-08-23  7:37                       ` Joseph Turner
  2024-08-23 12:34                         ` Eli Zaretskii
  2024-08-23  7:43                       ` Joseph Turner
  1 sibling, 1 reply; 40+ messages in thread
From: Joseph Turner @ 2024-08-23  7:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, schwab, adam

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Joseph Turner <joseph@ushin.org>
>> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
>> Date: Thu, 22 Aug 2024 20:56:19 -0700
>> 
>> Eli Zaretskii <eliz@gnu.org> writes:
>> 
>> >> For encoding text, should we encode with buffer-file-coding-system?
>> >
>> > If you are sure it will do, yes.  But what if the buffer started as
>> > all-ASCII and then the user or some Lisp program added some non-ASCII
>> > characters before saving?  Then buffer-file-coding-system is no longer
>> > pertinent.
>> 
>> I understand.  Thank you!
>> 
>> How do we encode if find-coding-systems-region returns '(undecided)?
>
> Use buffer-file-coding-system.
>
> If this is an interactive command, you could also use
> select-safe-coding-system, which calls find-coding-systems-region
> internally, and also has complex logic for finding suitable callbacks
> and asking the user to select an encoding if it fails to find
> something suitable.  But this is not appropriate in non-interactive
> code.

Thank you!  If both find-coding-systems-region and
buffer-file-coding-system are undecided, then is it safe to fallback to
utf-8?

I feel grateful for your thorough attention to this topic.

Joseph



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-23  7:02                     ` Eli Zaretskii
  2024-08-23  7:37                       ` Joseph Turner
@ 2024-08-23  7:43                       ` Joseph Turner
  2024-08-23 12:38                         ` Eli Zaretskii
  1 sibling, 1 reply; 40+ messages in thread
From: Joseph Turner @ 2024-08-23  7:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, schwab, adam

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Joseph Turner <joseph@ushin.org>
>> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
>> Date: Thu, 22 Aug 2024 20:56:19 -0700
>> 
>> Eli Zaretskii <eliz@gnu.org> writes:
>> 
>> >> For encoding text, should we encode with buffer-file-coding-system?
>> >
>> > If you are sure it will do, yes.  But what if the buffer started as
>> > all-ASCII and then the user or some Lisp program added some non-ASCII
>> > characters before saving?  Then buffer-file-coding-system is no longer
>> > pertinent.
>> 
>> I understand.  Thank you!
>> 
>> How do we encode if find-coding-systems-region returns '(undecided)?
>
> Use buffer-file-coding-system.
>
> If this is an interactive command, you could also use
> select-safe-coding-system, which calls find-coding-systems-region
> internally, and also has complex logic for finding suitable callbacks
> and asking the user to select an encoding if it fails to find
> something suitable.  But this is not appropriate in non-interactive
> code.

I'm surprised that

(with-temp-buffer
  (insert "你好")
  (set-buffer-file-coding-system 'chinese-big5)
  (car (find-coding-systems-region (point-min) (point-max))))

returns 'utf-8 and not 'chinese-big5.  Are the codings intended to be
ordered by priority?  If so, should buffer-file-coding-system be at the
front of the list if it's safe?



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-23  7:07                   ` Joseph Turner
@ 2024-08-23  7:58                     ` Joseph Turner
  0 siblings, 0 replies; 40+ messages in thread
From: Joseph Turner @ 2024-08-23  7:58 UTC (permalink / raw)
  To: Adam Porter; +Cc: Eli Zaretskii, emacs-devel, schwab

[-- Attachment #1: Type: text/plain, Size: 134 bytes --]

Joseph Turner <joseph@ushin.org> writes:

> Please see patch!  (not tested yet)

I made a typo in the last patch.  Here's a new one.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Add-plz-ENCODE-argument-Add-default-encoding.patch --]
[-- Type: text/x-diff, Size: 3981 bytes --]

From 9ff971c6bbf00ebfe33a6e8993a006a168b4c6cb Mon Sep 17 00:00:00 2001
From: Joseph Turner <joseph@breatheoutbreathe.in>
Date: Thu, 22 Aug 2024 00:02:19 -0700
Subject: [PATCH] Add: (plz) ENCODE argument; Add default encoding

Previously, PUT requests which pass an unencoded string or buffer with
multibyte characters sent the internal Emacs multibyte representation.

Now strings and buffers are encoded by default, and the ENCODE nil
argument (or :BODY-TYPE 'binary) can used when the user wants to
handle encoding.

WIP
---
 plz.el | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/plz.el b/plz.el
index 903d71e..2a2077d 100644
--- a/plz.el
+++ b/plz.el
@@ -325,7 +325,8 @@ (defalias 'plz--generate-new-buffer
 
 (cl-defun plz (method url &rest rest &key headers body else filter finally noquery timeout
                       (as 'string) (then 'sync)
-                      (body-type 'text) (decode t decode-s)
+                      (body-type 'text) (encode t encode-s)
+                      (decode t decode-s)
                       (connect-timeout plz-connect-timeout))
   "Request METHOD from URL with curl.
 Return the curl process object or, for a synchronous request, the
@@ -340,6 +341,11 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
 BODY-TYPE may be `text' to send BODY as text, or `binary' to send
 it as binary.
 
+If ENCODE is non-nil, BODY is encoded automatically.  For binary
+content, it should be nil.  When BODY-TYPE is `binary', ENCODE is
+automatically set to nil.  ENCODE has no effect when BODY is a
+list like `(file FILENAME)'.
+
 AS selects the kind of result to pass to the callback function
 THEN, or the kind of result to return for synchronous requests.
 It may be:
@@ -416,6 +422,8 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
   (declare (indent defun))
   (setf decode (if (and decode-s (not decode))
                    nil decode))
+  (setf encode (if (and encode-s (not encode))
+                   nil encode))
   ;; NOTE: By default, for PUT requests and POST requests >1KB, curl sends an
   ;; "Expect:" header, which causes servers to send a "100 Continue" response, which
   ;; we don't want to have to deal with, so we disable it by setting the header to
@@ -473,6 +481,9 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
          (decode (pcase as
                    ('binary nil)
                    (_ decode)))
+         (encode (pcase body-type
+                   ('binary nil)
+                   (_ encode)))
          (default-directory
           ;; Avoid making process in a nonexistent directory (in case the current
           ;; default-directory has since been removed).  It's unclear what the best
@@ -553,9 +564,19 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque
     (process-send-string process curl-config)
     (when body
       (cl-typecase body
-        (string (process-send-string process body))
-        (buffer (with-current-buffer body
-                  (process-send-region process (point-min) (point-max))))))
+        (string (process-send-string
+                 process (if encode
+                             (encode-coding-string
+                              body (cdr default-process-coding-system))
+                           body)))
+        (buffer (if encode
+                    (with-temp-buffer
+                      (insert-buffer-substring-no-properties body)
+                      (encode-coding-region
+                       (point-min) (point-max) (cdr default-process-coding-system))
+                      (process-send-region process (point-min) (point-max)))
+                  (with-current-buffer body
+                    (process-send-region process (point-min) (point-max)))))))
     (process-send-eof process)
     (if sync-p
         (unwind-protect
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-23  7:37                       ` Joseph Turner
@ 2024-08-23 12:34                         ` Eli Zaretskii
  0 siblings, 0 replies; 40+ messages in thread
From: Eli Zaretskii @ 2024-08-23 12:34 UTC (permalink / raw)
  To: Joseph Turner; +Cc: emacs-devel, schwab, adam

> From: Joseph Turner <joseph@ushin.org>
> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
> Date: Fri, 23 Aug 2024 00:37:33 -0700
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> How do we encode if find-coding-systems-region returns '(undecided)?
> >
> > Use buffer-file-coding-system.
> >
> > If this is an interactive command, you could also use
> > select-safe-coding-system, which calls find-coding-systems-region
> > internally, and also has complex logic for finding suitable callbacks
> > and asking the user to select an encoding if it fails to find
> > something suitable.  But this is not appropriate in non-interactive
> > code.
> 
> Thank you!  If both find-coding-systems-region and
> buffer-file-coding-system are undecided, then is it safe to fallback to
> utf-8?

Yes.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-23  7:43                       ` Joseph Turner
@ 2024-08-23 12:38                         ` Eli Zaretskii
  2024-08-23 16:59                           ` Joseph Turner
  0 siblings, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2024-08-23 12:38 UTC (permalink / raw)
  To: Joseph Turner; +Cc: emacs-devel, schwab, adam

> From: Joseph Turner <joseph@ushin.org>
> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
> Date: Fri, 23 Aug 2024 00:43:52 -0700
> 
> I'm surprised that
> 
> (with-temp-buffer
>   (insert "你好")
>   (set-buffer-file-coding-system 'chinese-big5)
>   (car (find-coding-systems-region (point-min) (point-max))))
> 
> returns 'utf-8 and not 'chinese-big5.

What does coding-system-priority-list returns in your case?

> Are the codings intended to be
> ordered by priority?

Yes.

> If so, should buffer-file-coding-system be at the front of the list
> if it's safe?

How do you know it's safe?

If your application needs to prefer buffer-file-coding-system, then
you should see if buffer-file-coding-system is a member of the list
returned by find-coding-systems-region, and if so, use that.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-23 12:38                         ` Eli Zaretskii
@ 2024-08-23 16:59                           ` Joseph Turner
  2024-08-23 17:35                             ` Eli Zaretskii
  0 siblings, 1 reply; 40+ messages in thread
From: Joseph Turner @ 2024-08-23 16:59 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, schwab, adam

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Joseph Turner <joseph@ushin.org>
>> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
>> Date: Fri, 23 Aug 2024 00:43:52 -0700
>> 
>> I'm surprised that
>> 
>> (with-temp-buffer
>>   (insert "你好")
>>   (set-buffer-file-coding-system 'chinese-big5)
>>   (car (find-coding-systems-region (point-min) (point-max))))
>> 
>> returns 'utf-8 and not 'chinese-big5.
>
> What does coding-system-priority-list returns in your case?

'utf-8

>> Are the codings intended to be
>> ordered by priority?
>
> Yes.
>
>> If so, should buffer-file-coding-system be at the front of the list
>> if it's safe?
>
> How do you know it's safe?
>
> If your application needs to prefer buffer-file-coding-system, then
> you should see if buffer-file-coding-system is a member of the list
> returned by find-coding-systems-region, and if so, use that.

I'd have thought that most applications would want to prefer
buffer-file-coding-system if it's a member of the list returned by
find-coding-systems-region, but perhaps not.

I now have a clear path forward for hyperdrive.el.

Thank you for your time, Eli!!

Joseph




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-23 16:59                           ` Joseph Turner
@ 2024-08-23 17:35                             ` Eli Zaretskii
  2024-08-23 20:37                               ` Joseph Turner
  0 siblings, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2024-08-23 17:35 UTC (permalink / raw)
  To: Joseph Turner; +Cc: emacs-devel, schwab, adam

> From: Joseph Turner <joseph@ushin.org>
> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
> Date: Fri, 23 Aug 2024 09:59:22 -0700
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> From: Joseph Turner <joseph@ushin.org>
> >> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
> >> Date: Fri, 23 Aug 2024 00:43:52 -0700
> >> 
> >> I'm surprised that
> >> 
> >> (with-temp-buffer
> >>   (insert "你好")
> >>   (set-buffer-file-coding-system 'chinese-big5)
> >>   (car (find-coding-systems-region (point-min) (point-max))))
> >> 
> >> returns 'utf-8 and not 'chinese-big5.
> >
> > What does coding-system-priority-list returns in your case?
> 
> 'utf-8

That explains what you see, then.

> >> Are the codings intended to be
> >> ordered by priority?
> >
> > Yes.
> >
> >> If so, should buffer-file-coding-system be at the front of the list
> >> if it's safe?
> >
> > How do you know it's safe?
> >
> > If your application needs to prefer buffer-file-coding-system, then
> > you should see if buffer-file-coding-system is a member of the list
> > returned by find-coding-systems-region, and if so, use that.
> 
> I'd have thought that most applications would want to prefer
> buffer-file-coding-system if it's a member of the list returned by
> find-coding-systems-region, but perhaps not.

Most applications use select-safe-coding-system, which AFAIR already
does all that.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-23 17:35                             ` Eli Zaretskii
@ 2024-08-23 20:37                               ` Joseph Turner
  0 siblings, 0 replies; 40+ messages in thread
From: Joseph Turner @ 2024-08-23 20:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, schwab, adam

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Joseph Turner <joseph@ushin.org>
>> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
>> Date: Fri, 23 Aug 2024 09:59:22 -0700
>> 
>> Eli Zaretskii <eliz@gnu.org> writes:
>> 
>> >> From: Joseph Turner <joseph@ushin.org>
>> >> Cc: emacs-devel@gnu.org,  schwab@suse.de,  adam@alphapapa.net
>> >> Date: Fri, 23 Aug 2024 00:43:52 -0700
>> >> 
>> >> I'm surprised that
>> >> 
>> >> (with-temp-buffer
>> >>   (insert "你好")
>> >>   (set-buffer-file-coding-system 'chinese-big5)
>> >>   (car (find-coding-systems-region (point-min) (point-max))))
>> >> 
>> >> returns 'utf-8 and not 'chinese-big5.
>> >
>> > What does coding-system-priority-list returns in your case?
>> 
>> 'utf-8
>
> That explains what you see, then.
>
>> >> Are the codings intended to be
>> >> ordered by priority?
>> >
>> > Yes.
>> >
>> >> If so, should buffer-file-coding-system be at the front of the list
>> >> if it's safe?
>> >
>> > How do you know it's safe?
>> >
>> > If your application needs to prefer buffer-file-coding-system, then
>> > you should see if buffer-file-coding-system is a member of the list
>> > returned by find-coding-systems-region, and if so, use that.
>> 
>> I'd have thought that most applications would want to prefer
>> buffer-file-coding-system if it's a member of the list returned by
>> find-coding-systems-region, but perhaps not.
>
> Most applications use select-safe-coding-system, which AFAIR already
> does all that.

Fantastic!  Yes, it appears that select-safe-coding-system does account
for buffer-file-coding-system.  So, hyperdrive.el can just encode with

(select-safe-coding-system (point-min) (point-max) nil nil FILENAME)

Simple.

Thank you!!

Joseph



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-23  3:56                   ` Joseph Turner
  2024-08-23  7:02                     ` Eli Zaretskii
@ 2024-08-24  6:14                     ` Joseph Turner
  1 sibling, 0 replies; 40+ messages in thread
From: Joseph Turner @ 2024-08-24  6:14 UTC (permalink / raw)
  To: emacs-devel; +Cc: schwab, adam

Joseph Turner <joseph@ushin.org> writes:
>
> However AFAIK, there's no specified or de facto standard for storing
> coding metadata in a hyperdrive, so this approach requires deliberation
> first.  I've made an issue on the `hypercore-fetch` repository:
>
> https://github.com/RangerMauve/hypercore-fetch/issues/100

If you're interested, here's a similar issue on the holepunch hyperdrive
tracker about a standard for storing hyperdrive file encoding metadata:

https://github.com/holepunchto/hyperdrive/issues/372

Thanks!

Joseph



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-20 11:15 ` Eli Zaretskii
  2024-08-21  9:20   ` Joseph Turner
@ 2024-08-26  6:37   ` Joseph Turner
  2024-08-26  6:49     ` Joseph Turner
  2024-08-26 11:20     ` Eli Zaretskii
  1 sibling, 2 replies; 40+ messages in thread
From: Joseph Turner @ 2024-08-26  6:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> you need to consider the encoding process: Emacs always encodes the
> buffer text on output to the external world.  If this is what you
> want, then you need to use bufferpos-to-filepos, and make sure you
> pass the correct coding-system argument to it.

Will the following code ever signal an error?

(bufferpos-to-filepos
 (point-max) 'exact
 (select-safe-coding-system (point-min) (point-max)))

The `bufferpos-to-filepos' docstring says, "It is an error to request
the ‘exact’ method when the buffer’s EOL format is not yet decided."
    
IOW, does `select-safe-coding-system' always return an encoding which
specifies EOL conversion?

Thank you!!

Joseph



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-26  6:37   ` Joseph Turner
@ 2024-08-26  6:49     ` Joseph Turner
  2024-08-26 11:22       ` Eli Zaretskii
  2024-08-26 11:20     ` Eli Zaretskii
  1 sibling, 1 reply; 40+ messages in thread
From: Joseph Turner @ 2024-08-26  6:49 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Joseph Turner <joseph@ushin.org> writes:

> Eli Zaretskii <eliz@gnu.org> writes:
>
>> you need to consider the encoding process: Emacs always encodes the
>> buffer text on output to the external world.  If this is what you
>> want, then you need to use bufferpos-to-filepos, and make sure you
>> pass the correct coding-system argument to it.
>
> Will the following code ever signal an error?
>
> (bufferpos-to-filepos
>  (point-max) 'exact
>  (select-safe-coding-system (point-min) (point-max)))
>
> The `bufferpos-to-filepos' docstring says, "It is an error to request
> the ‘exact’ method when the buffer’s EOL format is not yet decided."
>     
> IOW, does `select-safe-coding-system' always return an encoding which
> specifies EOL conversion?

Let me rephrase: I would like to get the size of a buffer's text encoded
with the return value of select-safe-coding-system, which may return an
encoding which does not specify EOL conversion.  Is there any way to
calculate the `exact' buffer text size using bufferpos-to-filepos?

Or is `approximate' the only viable argument in this case?

Thanks!

Joseph



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-26  6:37   ` Joseph Turner
  2024-08-26  6:49     ` Joseph Turner
@ 2024-08-26 11:20     ` Eli Zaretskii
  1 sibling, 0 replies; 40+ messages in thread
From: Eli Zaretskii @ 2024-08-26 11:20 UTC (permalink / raw)
  To: Joseph Turner; +Cc: emacs-devel

> From: Joseph Turner <joseph@ushin.org>
> Cc: emacs-devel@gnu.org
> Date: Sun, 25 Aug 2024 23:37:49 -0700
> 
> IOW, does `select-safe-coding-system' always return an encoding which
> specifies EOL conversion?

Yes, provided that either buffer-file-coding-system or its default
value specify a particular EOL conversion.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-26  6:49     ` Joseph Turner
@ 2024-08-26 11:22       ` Eli Zaretskii
  2024-08-27  4:48         ` Joseph Turner
  0 siblings, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2024-08-26 11:22 UTC (permalink / raw)
  To: Joseph Turner; +Cc: emacs-devel

> From: Joseph Turner <joseph@ushin.org>
> Cc: emacs-devel@gnu.org
> Date: Sun, 25 Aug 2024 23:49:52 -0700
> 
> Joseph Turner <joseph@ushin.org> writes:
> 
> > Eli Zaretskii <eliz@gnu.org> writes:
> >
> >> you need to consider the encoding process: Emacs always encodes the
> >> buffer text on output to the external world.  If this is what you
> >> want, then you need to use bufferpos-to-filepos, and make sure you
> >> pass the correct coding-system argument to it.
> >
> > Will the following code ever signal an error?
> >
> > (bufferpos-to-filepos
> >  (point-max) 'exact
> >  (select-safe-coding-system (point-min) (point-max)))
> >
> > The `bufferpos-to-filepos' docstring says, "It is an error to request
> > the ‘exact’ method when the buffer’s EOL format is not yet decided."
> >     
> > IOW, does `select-safe-coding-system' always return an encoding which
> > specifies EOL conversion?
> 
> Let me rephrase: I would like to get the size of a buffer's text encoded
> with the return value of select-safe-coding-system, which may return an
> encoding which does not specify EOL conversion.  Is there any way to
> calculate the `exact' buffer text size using bufferpos-to-filepos?
> 
> Or is `approximate' the only viable argument in this case?

Unless you must deal with exotic encodings (like iso-2022 and its
derivatives), I suggest to always use 'approximate'.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: How to get buffer byte length (not number of characters)?
  2024-08-26 11:22       ` Eli Zaretskii
@ 2024-08-27  4:48         ` Joseph Turner
  0 siblings, 0 replies; 40+ messages in thread
From: Joseph Turner @ 2024-08-27  4:48 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Joseph Turner <joseph@ushin.org>
>> Cc: emacs-devel@gnu.org
>> Date: Sun, 25 Aug 2024 23:49:52 -0700
>> 
>> Joseph Turner <joseph@ushin.org> writes:
>> 
>> > Eli Zaretskii <eliz@gnu.org> writes:
>> >
>> >> you need to consider the encoding process: Emacs always encodes the
>> >> buffer text on output to the external world.  If this is what you
>> >> want, then you need to use bufferpos-to-filepos, and make sure you
>> >> pass the correct coding-system argument to it.
>> >
>> > Will the following code ever signal an error?
>> >
>> > (bufferpos-to-filepos
>> >  (point-max) 'exact
>> >  (select-safe-coding-system (point-min) (point-max)))
>> >
>> > The `bufferpos-to-filepos' docstring says, "It is an error to request
>> > the ‘exact’ method when the buffer’s EOL format is not yet decided."
>> >     
>> > IOW, does `select-safe-coding-system' always return an encoding which
>> > specifies EOL conversion?
>> 
>> Let me rephrase: I would like to get the size of a buffer's text encoded
>> with the return value of select-safe-coding-system, which may return an
>> encoding which does not specify EOL conversion.  Is there any way to
>> calculate the `exact' buffer text size using bufferpos-to-filepos?
>> 
>> Or is `approximate' the only viable argument in this case?
>
> Unless you must deal with exotic encodings (like iso-2022 and its
> derivatives), I suggest to always use 'approximate'.

Thank you!  I will do that.



^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2024-08-27  4:48 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-20  7:10 How to get buffer byte length (not number of characters)? Joseph Turner
2024-08-20  7:51 ` Joseph Turner
2024-08-20 11:20   ` Eli Zaretskii
2024-08-20 11:15 ` Eli Zaretskii
2024-08-21  9:20   ` Joseph Turner
2024-08-21 17:47     ` Eli Zaretskii
2024-08-21 23:52       ` Joseph Turner
2024-08-22  4:06         ` Eli Zaretskii
2024-08-22  7:24           ` Joseph Turner
2024-08-22 11:04             ` Eli Zaretskii
2024-08-22 18:29               ` Joseph Turner
2024-08-22 18:44                 ` Eli Zaretskii
2024-08-22 19:32                   ` tomas
2024-08-23  3:56                   ` Joseph Turner
2024-08-23  7:02                     ` Eli Zaretskii
2024-08-23  7:37                       ` Joseph Turner
2024-08-23 12:34                         ` Eli Zaretskii
2024-08-23  7:43                       ` Joseph Turner
2024-08-23 12:38                         ` Eli Zaretskii
2024-08-23 16:59                           ` Joseph Turner
2024-08-23 17:35                             ` Eli Zaretskii
2024-08-23 20:37                               ` Joseph Turner
2024-08-24  6:14                     ` Joseph Turner
2024-08-22 12:26             ` Adam Porter
2024-08-22 12:47               ` tomas
2024-08-23  6:28                 ` Adam Porter
2024-08-22 13:50               ` Eli Zaretskii
2024-08-23  6:31                 ` Adam Porter
2024-08-23  6:51                   ` Eli Zaretskii
2024-08-23  7:07                   ` Joseph Turner
2024-08-23  7:58                     ` Joseph Turner
2024-08-22  7:09     ` Andreas Schwab
2024-08-22  7:30       ` Joseph Turner
2024-08-22 11:05         ` Eli Zaretskii
2024-08-26  6:37   ` Joseph Turner
2024-08-26  6:49     ` Joseph Turner
2024-08-26 11:22       ` Eli Zaretskii
2024-08-27  4:48         ` Joseph Turner
2024-08-26 11:20     ` Eli Zaretskii
2024-08-20 11:24 ` Andreas Schwab

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.