* How to get buffer byte length (not number of characters)? @ 2024-08-20 7:10 Joseph Turner 2024-08-20 7:51 ` Joseph Turner ` (2 more replies) 0 siblings, 3 replies; 40+ messages in thread From: Joseph Turner @ 2024-08-20 7:10 UTC (permalink / raw) To: Emacs Devel Mailing List Hello! `buffer-size' returns the number of characters in a buffer: (with-temp-buffer (insert "你好") (buffer-size)) ;; 2 However, the buffer's byte length may be larger: (let* ((filename (make-temp-file "buffer-size-test-")) (file (with-temp-file filename (insert "你好")))) (file-attribute-size (file-attributes filename))) ;; 6 How can I get a buffer's byte length without writing to a file? Thank you! Joseph ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-20 7:10 How to get buffer byte length (not number of characters)? Joseph Turner @ 2024-08-20 7:51 ` Joseph Turner 2024-08-20 11:20 ` Eli Zaretskii 2024-08-20 11:15 ` Eli Zaretskii 2024-08-20 11:24 ` Andreas Schwab 2 siblings, 1 reply; 40+ messages in thread From: Joseph Turner @ 2024-08-20 7:51 UTC (permalink / raw) To: Emacs Devel Mailing List Joseph Turner <joseph@ushin.org> writes: > How can I get a buffer's byte length without writing to a file? This seems to work: (with-temp-buffer (insert "你好") (set-buffer-multibyte nil) (buffer-size)) ;; 6 although, curiously, this does not: (with-temp-buffer (set-buffer-multibyte nil) (insert "你好") (buffer-size)) ;; 2 Is the `set-buffer-multibyte' approach the best solution? If I have a multibyte string and I want the byte length, do I need to insert it into a buffer and perform the same dance as above? Thank you! Joseph ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-20 7:51 ` Joseph Turner @ 2024-08-20 11:20 ` Eli Zaretskii 0 siblings, 0 replies; 40+ messages in thread From: Eli Zaretskii @ 2024-08-20 11:20 UTC (permalink / raw) To: Joseph Turner; +Cc: emacs-devel > From: Joseph Turner <joseph@ushin.org> > Date: Tue, 20 Aug 2024 00:51:18 -0700 > > Joseph Turner <joseph@ushin.org> writes: > > > How can I get a buffer's byte length without writing to a file? > > This seems to work: > > (with-temp-buffer > (insert "你好") > (set-buffer-multibyte nil) > (buffer-size)) ;; 6 > > although, curiously, this does not: > > (with-temp-buffer > (set-buffer-multibyte nil) > (insert "你好") > (buffer-size)) ;; 2 > > Is the `set-buffer-multibyte' approach the best solution? No, as you already discovered. Unibyte buffers and strings are messy and full of surprises, so my suggestion is to stay away of them as much as you can. > If I have a multibyte string and I want the byte length, do I need to > insert it into a buffer and perform the same dance as above? No, you can use string-bytes instead. But again: whether the result is useful for whatever the needs which triggered these questions, is uncertain, and my crystal ball says that this is not what you want. For example, raw bytes sometimes take 2 bytes in the internal Emacs representation, something that will get in the way of most uses of these results. So please tell more about the background and the context of these questions. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-20 7:10 How to get buffer byte length (not number of characters)? Joseph Turner 2024-08-20 7:51 ` Joseph Turner @ 2024-08-20 11:15 ` Eli Zaretskii 2024-08-21 9:20 ` Joseph Turner 2024-08-26 6:37 ` Joseph Turner 2024-08-20 11:24 ` Andreas Schwab 2 siblings, 2 replies; 40+ messages in thread From: Eli Zaretskii @ 2024-08-20 11:15 UTC (permalink / raw) To: Joseph Turner; +Cc: emacs-devel > From: Joseph Turner <joseph@ushin.org> > Date: Tue, 20 Aug 2024 00:10:50 -0700 > > Hello! > > `buffer-size' returns the number of characters in a buffer: > > (with-temp-buffer > (insert "你好") > (buffer-size)) ;; 2 > > However, the buffer's byte length may be larger: > > (let* ((filename (make-temp-file "buffer-size-test-")) > (file (with-temp-file filename (insert "你好")))) > (file-attribute-size (file-attributes filename))) ;; 6 > > How can I get a buffer's byte length without writing to a file? This depends on why do you need the byte length of the buffer. If I interpret your question literally, then this is the answer: (position-bytes (point-max)) perhaps preceded by a call to 'widen'. But that returns the number of bytes that the buffer's characters take when represented in the internal Emacs representation of characters, which is not necessarily useful to Lisp programs. For example, if you need to know how many bytes will Emacs write to a file if you save the buffer, or to a network connection or a sub-process if you send the buffer there, then you need to consider the encoding process: Emacs always encodes the buffer text on output to the external world. If this is what you want, then you need to use bufferpos-to-filepos, and make sure you pass the correct coding-system argument to it. If you need this for something else, please tell the details. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-20 11:15 ` Eli Zaretskii @ 2024-08-21 9:20 ` Joseph Turner 2024-08-21 17:47 ` Eli Zaretskii 2024-08-22 7:09 ` Andreas Schwab 2024-08-26 6:37 ` Joseph Turner 1 sibling, 2 replies; 40+ messages in thread From: Joseph Turner @ 2024-08-21 9:20 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, Andreas Schwab, Adam Porter Eli Zaretskii <eliz@gnu.org> writes: >> From: Joseph Turner <joseph@ushin.org> >> Date: Tue, 20 Aug 2024 00:10:50 -0700 >> >> How can I get a buffer's byte length without writing to a file? > > This depends on why do you need the byte length of the buffer. > > If I interpret your question literally, then this is the answer: > > (position-bytes (point-max)) > > perhaps preceded by a call to 'widen'. But that returns the number of > bytes that the buffer's characters take when represented in the > internal Emacs representation of characters, which is not necessarily > useful to Lisp programs. For example, if you need to know how many > bytes will Emacs write to a file if you save the buffer, or to a > network connection or a sub-process if you send the buffer there, then > you need to consider the encoding process: Emacs always encodes the > buffer text on output to the external world. If this is what you > want, then you need to use bufferpos-to-filepos, and make sure you > pass the correct coding-system argument to it. > > If you need this for something else, please tell the details. Thank you, Eli, Andreas! Eli's crystal ball is correct: I'd like to know how many bytes Emacs will send when passing buffer contents (or a string) to a subprocess, and first I need to figure out which coding system is appropriate. The hyperdrive.el package provides a UI for creating and accessing shared virtual filesystems. hyperdrive.el uses plz.el as an Elisp API for curl in order to communicate with a local HTTP server. We want to be able to create hyperdrive "files" in an Emacs buffer and then upload them with the correct encoding. We also want to know how large they will be before uploading them. A couple of examples: Let's say I create a textual hyperdrive file using hyperdrive.el, and then I upload it by sending its contents via curl to the local HTTP server. What coding system should be used when the file is uploaded? Let's say I have a `iso-latin-1'-encoded file "foo.txt" on my local filesystem. I upload this encoded file to my hyperdrive by passing the filename to curl, which uploads the bytes with no conversion. Then I open the "foo.txt" hyperdrive file using hyperdrive.el, which receives the contents via curl from the local HTTP server. In the hyperdrive file buffer, buffer-file-coding-system should be `iso-latin-1' (right?). Then, I edit the buffer and save it to the hyperdrive again with hyperdrive.el, which this time sends the modified contents over the wire to curl. The uploaded file should be `iso-latin-1'-encoded (right?). Currently, plz.el always creates the curl subprocess like so: (make-process :coding 'binary ...) https://git.savannah.gnu.org/cgit/emacs/elpa.git/tree/plz.el?h=externals-release/plz#n519 Does this DTRT? Should we use buffer-file-coding-system not 'binary? Thank you for helping me understand encodings in Emacs. Joseph ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-21 9:20 ` Joseph Turner @ 2024-08-21 17:47 ` Eli Zaretskii 2024-08-21 23:52 ` Joseph Turner 2024-08-22 7:09 ` Andreas Schwab 1 sibling, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2024-08-21 17:47 UTC (permalink / raw) To: Joseph Turner; +Cc: emacs-devel, schwab, adam > From: Joseph Turner <joseph@ushin.org> > Cc: emacs-devel@gnu.org, Andreas Schwab <schwab@suse.de>, Adam Porter > <adam@alphapapa.net> > Date: Wed, 21 Aug 2024 02:20:09 -0700 > > Let's say I create a textual hyperdrive file using hyperdrive.el, and > then I upload it by sending its contents via curl to the local HTTP > server. What coding system should be used when the file is uploaded? > > Let's say I have a `iso-latin-1'-encoded file "foo.txt" on my local > filesystem. I upload this encoded file to my hyperdrive by passing the > filename to curl, which uploads the bytes with no conversion. Then I > open the "foo.txt" hyperdrive file using hyperdrive.el, which receives > the contents via curl from the local HTTP server. In the hyperdrive > file buffer, buffer-file-coding-system should be `iso-latin-1' (right?). It's what I would expect, yes. But you can try it yourself, of course and make sure it is indeed what happens. > Then, I edit the buffer and save it to the hyperdrive again with > hyperdrive.el, which this time sends the modified contents over the wire > to curl. The uploaded file should be `iso-latin-1'-encoded (right?). Again, that'd be my expectation. But it's better to test this assumption. > Currently, plz.el always creates the curl subprocess like so: > > (make-process :coding 'binary ...) > > https://git.savannah.gnu.org/cgit/emacs/elpa.git/tree/plz.el?h=externals-release/plz#n519 > > Does this DTRT? It could be TRT if plz.el encodes the buffer text "by hand" before sending the results to curl and decodes it when it receives text from curl. Which I think is what happens there. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-21 17:47 ` Eli Zaretskii @ 2024-08-21 23:52 ` Joseph Turner 2024-08-22 4:06 ` Eli Zaretskii 0 siblings, 1 reply; 40+ messages in thread From: Joseph Turner @ 2024-08-21 23:52 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, schwab, adam Eli Zaretskii <eliz@gnu.org> writes: >> From: Joseph Turner <joseph@ushin.org> >> Cc: emacs-devel@gnu.org, Andreas Schwab <schwab@suse.de>, Adam Porter >> <adam@alphapapa.net> >> Date: Wed, 21 Aug 2024 02:20:09 -0700 >> >> Let's say I create a textual hyperdrive file using hyperdrive.el, and >> then I upload it by sending its contents via curl to the local HTTP >> server. What coding system should be used when the file is uploaded? >> >> Let's say I have a `iso-latin-1'-encoded file "foo.txt" on my local >> filesystem. I upload this encoded file to my hyperdrive by passing the >> filename to curl, which uploads the bytes with no conversion. Then I >> open the "foo.txt" hyperdrive file using hyperdrive.el, which receives >> the contents via curl from the local HTTP server. In the hyperdrive >> file buffer, buffer-file-coding-system should be `iso-latin-1' (right?). > > It's what I would expect, yes. But you can try it yourself, of course > and make sure it is indeed what happens. > >> Then, I edit the buffer and save it to the hyperdrive again with >> hyperdrive.el, which this time sends the modified contents over the wire >> to curl. The uploaded file should be `iso-latin-1'-encoded (right?). > > Again, that'd be my expectation. But it's better to test this > assumption. > >> Currently, plz.el always creates the curl subprocess like so: >> >> (make-process :coding 'binary ...) >> >> https://git.savannah.gnu.org/cgit/emacs/elpa.git/tree/plz.el?h=externals-release/plz#n519 >> >> Does this DTRT? > > It could be TRT if plz.el encodes the buffer text "by hand" before > sending the results to curl and decodes it when it receives text from > curl. Which I think is what happens there. plz.el does not manually encode buffer text *within Emacs* when sending requests to curl, but by default, plz.el sends data to curl with --data, which tells curl to strip CR and newlines. With the :body-type 'binary argument, plz.el instead uses --data-binary, which does no conversion. We don't want to strip newlines from hyperdrive files, so we always use :body-type 'binary when sending buffer contents. Should hyperdrive.el encode data with `buffer-file-coding-system' before passing to plz.el? When receiving text from curl, plz.el optionally decodes the text according to the charset in the 'Content-Type' header, e.g., "text/html; charset=utf-8" or utf-8 if no charset is found. Perhaps hyperdrive.el should check the 'Content-Type' header charset, then fallback to guessing the coding system based on filename and file contents with `set-auto-coding' (to avoid decoding images, etc.), and then finally fallback to something else? Thank you! Joseph ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-21 23:52 ` Joseph Turner @ 2024-08-22 4:06 ` Eli Zaretskii 2024-08-22 7:24 ` Joseph Turner 0 siblings, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2024-08-22 4:06 UTC (permalink / raw) To: Joseph Turner; +Cc: emacs-devel, schwab, adam > From: Joseph Turner <joseph@ushin.org> > Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net > Date: Wed, 21 Aug 2024 16:52:39 -0700 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> Currently, plz.el always creates the curl subprocess like so: > >> > >> (make-process :coding 'binary ...) > >> > >> https://git.savannah.gnu.org/cgit/emacs/elpa.git/tree/plz.el?h=externals-release/plz#n519 > >> > >> Does this DTRT? > > > > It could be TRT if plz.el encodes the buffer text "by hand" before > > sending the results to curl and decodes it when it receives text from > > curl. Which I think is what happens there. > > plz.el does not manually encode buffer text *within Emacs* when sending > requests to curl, but by default, plz.el sends data to curl with --data, > which tells curl to strip CR and newlines. With the :body-type 'binary > argument, plz.el instead uses --data-binary, which does no conversion. Newlines is a relatively minor issue (although it, too, needs to be considered). My main concern is with the text encoding. How can it be TRT to use 'binary when sending buffer text to curl? that would mean we are more-or-less always sending the internal representation of characters, which is superset of UTF-8. If the data was originally encoded in anything but UTF-8, reading it into Emacs and then sending it back will change the byte sequences from that other encoding to UTF-8. Moreover, 'binary does not guarantee that the result is valid UTF-8. So maybe I misunderstand how these plz.el facilities are used, but up front this sounds like a mistake. > We don't want to strip newlines from hyperdrive files, so we always use > :body-type 'binary when sending buffer contents. Should hyperdrive.el > encode data with `buffer-file-coding-system' before passing to plz.el? I would think so, but maybe we should bring the plz.el developers on board of this discussion. > When receiving text from curl, plz.el optionally decodes the text > according to the charset in the 'Content-Type' header, e.g., "text/html; > charset=utf-8" or utf-8 if no charset is found. By "optionally" you mean that it doesn't always happen, except if the caller requests that? If so, the caller of plz.el should decode the text manually before using it in user-facing features. > Perhaps hyperdrive.el should check the 'Content-Type' header charset, > then fallback to guessing the coding system based on filename and file > contents with `set-auto-coding' (to avoid decoding images, etc.), and > then finally fallback to something else? Probably. But then I don't know anything about hyperdrive.el, either. If it copies text between files or URLs without showing it to the user, then the best strategy is indeed not to decode and encode stuff, but handle it as a stream of raw bytes. (In that case, my suggestion would be to use unibyte buffers and strings for temporarily storing and processing these raw bytes in Emacs.) But if the text is somehow shown to the user, it must be decoded to be displayed correctly by Emacs. And then it must be encoded back when writing it back to the external storage. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-22 4:06 ` Eli Zaretskii @ 2024-08-22 7:24 ` Joseph Turner 2024-08-22 11:04 ` Eli Zaretskii 2024-08-22 12:26 ` Adam Porter 0 siblings, 2 replies; 40+ messages in thread From: Joseph Turner @ 2024-08-22 7:24 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, schwab, adam [-- Attachment #1: Type: text/plain, Size: 4002 bytes --] Eli Zaretskii <eliz@gnu.org> writes: >> From: Joseph Turner <joseph@ushin.org> >> Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net >> Date: Wed, 21 Aug 2024 16:52:39 -0700 >> >> Eli Zaretskii <eliz@gnu.org> writes: >> >> >> Currently, plz.el always creates the curl subprocess like so: >> >> >> >> (make-process :coding 'binary ...) >> >> >> >> https://git.savannah.gnu.org/cgit/emacs/elpa.git/tree/plz.el?h=externals-release/plz#n519 >> >> >> >> Does this DTRT? >> > >> > It could be TRT if plz.el encodes the buffer text "by hand" before >> > sending the results to curl and decodes it when it receives text from >> > curl. Which I think is what happens there. >> >> plz.el does not manually encode buffer text *within Emacs* when sending >> requests to curl, but by default, plz.el sends data to curl with --data, >> which tells curl to strip CR and newlines. With the :body-type 'binary >> argument, plz.el instead uses --data-binary, which does no conversion. > > Newlines is a relatively minor issue (although it, too, needs to be > considered). My main concern is with the text encoding. How can it > be TRT to use 'binary when sending buffer text to curl? that would > mean we are more-or-less always sending the internal representation of > characters, which is superset of UTF-8. If the data was originally > encoded in anything but UTF-8, reading it into Emacs and then sending > it back will change the byte sequences from that other encoding to > UTF-8. Moreover, 'binary does not guarantee that the result is valid > UTF-8. > > So maybe I misunderstand how these plz.el facilities are used, but up > front this sounds like a mistake. It could be. Eli, Adam, what do you think about the default coding systems for encoding the request body in the attached patch? >> We don't want to strip newlines from hyperdrive files, so we always use >> :body-type 'binary when sending buffer contents. Should hyperdrive.el >> encode data with `buffer-file-coding-system' before passing to plz.el? > > I would think so, but maybe we should bring the plz.el developers on > board of this discussion. I've CC'd Adam. >> When receiving text from curl, plz.el optionally decodes the text >> according to the charset in the 'Content-Type' header, e.g., "text/html; >> charset=utf-8" or utf-8 if no charset is found. > > By "optionally" you mean that it doesn't always happen, except if the > caller requests that? If so, the caller of plz.el should decode the > text manually before using it in user-facing features. By default, `plz' decodes response body according to the 'Content-Type' charset (or utf-8 as fallback). Passing `:decode nil' stops that. >> Perhaps hyperdrive.el should check the 'Content-Type' header charset, >> then fallback to guessing the coding system based on filename and file >> contents with `set-auto-coding' (to avoid decoding images, etc.), and >> then finally fallback to something else? > > Probably. But then I don't know anything about hyperdrive.el, either. > If it copies text between files or URLs without showing it to the > user, then the best strategy is indeed not to decode and encode stuff, > but handle it as a stream of raw bytes. (In that case, my suggestion > would be to use unibyte buffers and strings for temporarily storing > and processing these raw bytes in Emacs.) But if the text is somehow > shown to the user, it must be decoded to be displayed correctly by > Emacs. And then it must be encoded back when writing it back to the > external storage. Thanks! Good to know about unibyte buffers and strings for that. hyperdrive.el does show text to the user, so we'll likely do something like what I described above. What fallback encoding should we use if there's no 'Content-Type' charset and `set-auto-coding' returns nil? IIUC, there's no foolproof way to guess the encoding of unknown bytes. default-file-name-coding-system? Thank you!! I feel more solid in my understanding of encodings now. Joseph [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-Add-plz-BODY-CODING-argument-Add-default-encoding.patch --] [-- Type: text/x-diff, Size: 3377 bytes --] From a684ff680ab05f359b628623159b4d3392eb448e Mon Sep 17 00:00:00 2001 From: Joseph Turner <joseph@breatheoutbreathe.in> Date: Thu, 22 Aug 2024 00:02:19 -0700 Subject: [PATCH] Add: (plz) BODY-CODING argument; Add default encoding Previously, strings and buffers were sent to curl as the internal Emacs representation. Now strings and buffers are encoded, and the BODY-CODING argument can used to override the default coding systems. --- plz.el | 29 ++++++++++++++++++++++++++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/plz.el b/plz.el index 903d71e..91d41d2 100644 --- a/plz.el +++ b/plz.el @@ -323,7 +323,7 @@ (defalias 'plz--generate-new-buffer ;;;;; Public -(cl-defun plz (method url &rest rest &key headers body else filter finally noquery timeout +(cl-defun plz (method url &rest rest &key headers body else filter finally noquery timeout body-coding (as 'string) (then 'sync) (body-type 'text) (decode t decode-s) (connect-timeout plz-connect-timeout)) @@ -340,6 +340,13 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque BODY-TYPE may be `text' to send BODY as text, or `binary' to send it as binary. +BODY-CODING may a coding system used to encode BODY before +passing it to curl. BODY-CODING has no effect when BODY is a +list like `(file FILENAME)'. If nil and BODY is a string, the +default process I/O output coding system is used. If nil and +BODY is a buffer, the buffer-local value of +`buffer-file-coding-system' is used. + AS selects the kind of result to pass to the callback function THEN, or the kind of result to return for synchronous requests. It may be: @@ -416,6 +423,19 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque (declare (indent defun)) (setf decode (if (and decode-s (not decode)) nil decode)) + (unless body-coding + (pcase-exhaustive body + (`(file ,filename) + ;; Don't set BODY-CODING; files are passed as-is to curl. + (setf body-coding nil)) + ((pred stringp) + ;; Use default output coding for processes. + (setf body-coding (cdr default-process-coding-system))) + ((and (pred bufferp) buffer) + ;; Use buffer-local coding. + (setf body-coding + (buffer-local-value 'buffer-file-coding-system buffer))))) + ;; NOTE: By default, for PUT requests and POST requests >1KB, curl sends an ;; "Expect:" header, which causes servers to send a "100 Continue" response, which ;; we don't want to have to deal with, so we disable it by setting the header to @@ -553,8 +573,11 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque (process-send-string process curl-config) (when body (cl-typecase body - (string (process-send-string process body)) - (buffer (with-current-buffer body + (string (process-send-string + process (encode-coding-string body body-coding t))) + (buffer (with-temp-buffer + (insert-buffer-substring-no-properties body) + (encode-coding-region (point-min) (point-max) body-coding) (process-send-region process (point-min) (point-max)))))) (process-send-eof process) (if sync-p -- 2.41.0 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-22 7:24 ` Joseph Turner @ 2024-08-22 11:04 ` Eli Zaretskii 2024-08-22 18:29 ` Joseph Turner 2024-08-22 12:26 ` Adam Porter 1 sibling, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2024-08-22 11:04 UTC (permalink / raw) To: Joseph Turner; +Cc: emacs-devel, schwab, adam > From: Joseph Turner <joseph@ushin.org> > Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net > Date: Thu, 22 Aug 2024 00:24:45 -0700 > > > So maybe I misunderstand how these plz.el facilities are used, but up > > front this sounds like a mistake. > > It could be. Eli, Adam, what do you think about the default coding > systems for encoding the request body in the attached patch? I think it is better to use detect-coding-region instead, if buffer-file-coding-system is undecided. > > By "optionally" you mean that it doesn't always happen, except if the > > caller requests that? If so, the caller of plz.el should decode the > > text manually before using it in user-facing features. > > By default, `plz' decodes response body according to the 'Content-Type' > charset (or utf-8 as fallback). Passing `:decode nil' stops that. Sounds correct. > default-file-name-coding-system? That's for file names, so it is not what you want here. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-22 11:04 ` Eli Zaretskii @ 2024-08-22 18:29 ` Joseph Turner 2024-08-22 18:44 ` Eli Zaretskii 0 siblings, 1 reply; 40+ messages in thread From: Joseph Turner @ 2024-08-22 18:29 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, schwab, adam Eli Zaretskii <eliz@gnu.org> writes: >> From: Joseph Turner <joseph@ushin.org> >> Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net >> Date: Thu, 22 Aug 2024 00:24:45 -0700 >> >> > So maybe I misunderstand how these plz.el facilities are used, but up >> > front this sounds like a mistake. >> >> It could be. Eli, Adam, what do you think about the default coding >> systems for encoding the request body in the attached patch? > > I think it is better to use detect-coding-region instead, if > buffer-file-coding-system is undecided. detect-coding-region is only useful when decoding text, right? For encoding text, should we encode with buffer-file-coding-system? >> > By "optionally" you mean that it doesn't always happen, except if the >> > caller requests that? If so, the caller of plz.el should decode the >> > text manually before using it in user-facing features. >> >> By default, `plz' decodes response body according to the 'Content-Type' >> charset (or utf-8 as fallback). Passing `:decode nil' stops that. > > Sounds correct. When decoding, should plz fallback to detect-coding-region instead of utf-8? >> default-file-name-coding-system? > > That's for file names, so it is not what you want here. Thanks! So when decoding text in hyperdrive.el, we can use (1) Content-Type charset, or (2) use `detect-coding-region' as a fallback. IIUC, there's no need to use `set-auto-coding', since `detect-coding-region' DTRT. Joseph ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-22 18:29 ` Joseph Turner @ 2024-08-22 18:44 ` Eli Zaretskii 2024-08-22 19:32 ` tomas 2024-08-23 3:56 ` Joseph Turner 0 siblings, 2 replies; 40+ messages in thread From: Eli Zaretskii @ 2024-08-22 18:44 UTC (permalink / raw) To: Joseph Turner; +Cc: emacs-devel, schwab, adam > From: Joseph Turner <joseph@ushin.org> > Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net > Date: Thu, 22 Aug 2024 11:29:48 -0700 > > Eli Zaretskii <eliz@gnu.org> writes: > > > I think it is better to use detect-coding-region instead, if > > buffer-file-coding-system is undecided. > > detect-coding-region is only useful when decoding text, right? Yes, sorry. I should have said find-coding-systems-region. > For encoding text, should we encode with buffer-file-coding-system? If you are sure it will do, yes. But what if the buffer started as all-ASCII and then the user or some Lisp program added some non-ASCII characters before saving? Then buffer-file-coding-system is no longer pertinent. > >> > By "optionally" you mean that it doesn't always happen, except if the > >> > caller requests that? If so, the caller of plz.el should decode the > >> > text manually before using it in user-facing features. > >> > >> By default, `plz' decodes response body according to the 'Content-Type' > >> charset (or utf-8 as fallback). Passing `:decode nil' stops that. > > > > Sounds correct. > > When decoding, should plz fallback to detect-coding-region instead of utf-8? If this is HTML, then I think it is okay to trust the headers about the charset and default to UTF-8. The problem with detect-coding-region is that some of it is based on guesswork, which is one reason why it could take a UTF-8 encoded text to be Latin-1. So if a more reliable source of information is available, we had better used it. > Thanks! So when decoding text in hyperdrive.el, we can use (1) > Content-Type charset, or (2) use `detect-coding-region' as a fallback. That's also possible. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-22 18:44 ` Eli Zaretskii @ 2024-08-22 19:32 ` tomas 2024-08-23 3:56 ` Joseph Turner 1 sibling, 0 replies; 40+ messages in thread From: tomas @ 2024-08-22 19:32 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Joseph Turner, emacs-devel, schwab, adam [-- Attachment #1: Type: text/plain, Size: 936 bytes --] On Thu, Aug 22, 2024 at 09:44:04PM +0300, Eli Zaretskii wrote: > > From: Joseph Turner <joseph@ushin.org> [...] > > When decoding, should plz fallback to detect-coding-region instead of utf-8? > > If this is HTML, then I think it is okay to trust the headers about > the charset and default to UTF-8. The problem with > detect-coding-region is that some of it is based on guesswork [...] Yes, and it's incredibly crude guesswork at times. Talk to the server admin. With HTML and friends, you get one or two layers of fun, because they can declare the encoding /whithin/ the stream (HTML in two different ways, at least). If the "outer layer" decides to helpfully recode, then the inner declarations are lying (I actually had this with HTML mails: the MIME layer recoded Latin-1 to UTF-8, the tag <meta charset="iso-8859-1"> in there was a lie. Needless to say, html2text made mojibake :-) Cheers -- t [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-22 18:44 ` Eli Zaretskii 2024-08-22 19:32 ` tomas @ 2024-08-23 3:56 ` Joseph Turner 2024-08-23 7:02 ` Eli Zaretskii 2024-08-24 6:14 ` Joseph Turner 1 sibling, 2 replies; 40+ messages in thread From: Joseph Turner @ 2024-08-23 3:56 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, schwab, adam Eli Zaretskii <eliz@gnu.org> writes: >> From: Joseph Turner <joseph@ushin.org> >> Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net >> Date: Thu, 22 Aug 2024 11:29:48 -0700 >> >> Eli Zaretskii <eliz@gnu.org> writes: >> >> > I think it is better to use detect-coding-region instead, if >> > buffer-file-coding-system is undecided. >> >> detect-coding-region is only useful when decoding text, right? > > Yes, sorry. I should have said find-coding-systems-region. > >> For encoding text, should we encode with buffer-file-coding-system? > > If you are sure it will do, yes. But what if the buffer started as > all-ASCII and then the user or some Lisp program added some non-ASCII > characters before saving? Then buffer-file-coding-system is no longer > pertinent. I understand. Thank you! How do we encode if find-coding-systems-region returns '(undecided)? >> >> > By "optionally" you mean that it doesn't always happen, except if the >> >> > caller requests that? If so, the caller of plz.el should decode the >> >> > text manually before using it in user-facing features. >> >> >> >> By default, `plz' decodes response body according to the 'Content-Type' >> >> charset (or utf-8 as fallback). Passing `:decode nil' stops that. >> > >> > Sounds correct. >> >> When decoding, should plz fallback to detect-coding-region instead of utf-8? > > If this is HTML, then I think it is okay to trust the headers about > the charset and default to UTF-8. The problem with > detect-coding-region is that some of it is based on guesswork, which > is one reason why it could take a UTF-8 encoded text to be Latin-1. > So if a more reliable source of information is available, we had > better used it. Andreas says: > Yes, and it's incredibly crude guesswork at times. Talk to the server > admin. With hyperdrive p2p file sharing, there is no server admin. 😉 Ideally, when users PUT a file into a hyperdrive, hyperdrive.el would encode the buffer with: (car (find-coding-systems-region (point-min) (point-max))) and then send the coding system along with the file in the PUT request. The coding system would be stored with the hyperdrive file metadata, for other users to load along with the file contents. On the other end of the network, hyperdrive.el would use (decode-coding-region (point-min) (point-max) CODING-FROM-HYPERDRIVE-METADATA) However AFAIK, there's no specified or de facto standard for storing coding metadata in a hyperdrive, so this approach requires deliberation first. I've made an issue on the `hypercore-fetch` repository: https://github.com/RangerMauve/hypercore-fetch/issues/100 For now, we'll rely on detect-coding-region for decoding. Thanks! Joseph ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-23 3:56 ` Joseph Turner @ 2024-08-23 7:02 ` Eli Zaretskii 2024-08-23 7:37 ` Joseph Turner 2024-08-23 7:43 ` Joseph Turner 2024-08-24 6:14 ` Joseph Turner 1 sibling, 2 replies; 40+ messages in thread From: Eli Zaretskii @ 2024-08-23 7:02 UTC (permalink / raw) To: Joseph Turner; +Cc: emacs-devel, schwab, adam > From: Joseph Turner <joseph@ushin.org> > Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net > Date: Thu, 22 Aug 2024 20:56:19 -0700 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> For encoding text, should we encode with buffer-file-coding-system? > > > > If you are sure it will do, yes. But what if the buffer started as > > all-ASCII and then the user or some Lisp program added some non-ASCII > > characters before saving? Then buffer-file-coding-system is no longer > > pertinent. > > I understand. Thank you! > > How do we encode if find-coding-systems-region returns '(undecided)? Use buffer-file-coding-system. If this is an interactive command, you could also use select-safe-coding-system, which calls find-coding-systems-region internally, and also has complex logic for finding suitable callbacks and asking the user to select an encoding if it fails to find something suitable. But this is not appropriate in non-interactive code. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-23 7:02 ` Eli Zaretskii @ 2024-08-23 7:37 ` Joseph Turner 2024-08-23 12:34 ` Eli Zaretskii 2024-08-23 7:43 ` Joseph Turner 1 sibling, 1 reply; 40+ messages in thread From: Joseph Turner @ 2024-08-23 7:37 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, schwab, adam Eli Zaretskii <eliz@gnu.org> writes: >> From: Joseph Turner <joseph@ushin.org> >> Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net >> Date: Thu, 22 Aug 2024 20:56:19 -0700 >> >> Eli Zaretskii <eliz@gnu.org> writes: >> >> >> For encoding text, should we encode with buffer-file-coding-system? >> > >> > If you are sure it will do, yes. But what if the buffer started as >> > all-ASCII and then the user or some Lisp program added some non-ASCII >> > characters before saving? Then buffer-file-coding-system is no longer >> > pertinent. >> >> I understand. Thank you! >> >> How do we encode if find-coding-systems-region returns '(undecided)? > > Use buffer-file-coding-system. > > If this is an interactive command, you could also use > select-safe-coding-system, which calls find-coding-systems-region > internally, and also has complex logic for finding suitable callbacks > and asking the user to select an encoding if it fails to find > something suitable. But this is not appropriate in non-interactive > code. Thank you! If both find-coding-systems-region and buffer-file-coding-system are undecided, then is it safe to fallback to utf-8? I feel grateful for your thorough attention to this topic. Joseph ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-23 7:37 ` Joseph Turner @ 2024-08-23 12:34 ` Eli Zaretskii 0 siblings, 0 replies; 40+ messages in thread From: Eli Zaretskii @ 2024-08-23 12:34 UTC (permalink / raw) To: Joseph Turner; +Cc: emacs-devel, schwab, adam > From: Joseph Turner <joseph@ushin.org> > Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net > Date: Fri, 23 Aug 2024 00:37:33 -0700 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> How do we encode if find-coding-systems-region returns '(undecided)? > > > > Use buffer-file-coding-system. > > > > If this is an interactive command, you could also use > > select-safe-coding-system, which calls find-coding-systems-region > > internally, and also has complex logic for finding suitable callbacks > > and asking the user to select an encoding if it fails to find > > something suitable. But this is not appropriate in non-interactive > > code. > > Thank you! If both find-coding-systems-region and > buffer-file-coding-system are undecided, then is it safe to fallback to > utf-8? Yes. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-23 7:02 ` Eli Zaretskii 2024-08-23 7:37 ` Joseph Turner @ 2024-08-23 7:43 ` Joseph Turner 2024-08-23 12:38 ` Eli Zaretskii 1 sibling, 1 reply; 40+ messages in thread From: Joseph Turner @ 2024-08-23 7:43 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, schwab, adam Eli Zaretskii <eliz@gnu.org> writes: >> From: Joseph Turner <joseph@ushin.org> >> Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net >> Date: Thu, 22 Aug 2024 20:56:19 -0700 >> >> Eli Zaretskii <eliz@gnu.org> writes: >> >> >> For encoding text, should we encode with buffer-file-coding-system? >> > >> > If you are sure it will do, yes. But what if the buffer started as >> > all-ASCII and then the user or some Lisp program added some non-ASCII >> > characters before saving? Then buffer-file-coding-system is no longer >> > pertinent. >> >> I understand. Thank you! >> >> How do we encode if find-coding-systems-region returns '(undecided)? > > Use buffer-file-coding-system. > > If this is an interactive command, you could also use > select-safe-coding-system, which calls find-coding-systems-region > internally, and also has complex logic for finding suitable callbacks > and asking the user to select an encoding if it fails to find > something suitable. But this is not appropriate in non-interactive > code. I'm surprised that (with-temp-buffer (insert "你好") (set-buffer-file-coding-system 'chinese-big5) (car (find-coding-systems-region (point-min) (point-max)))) returns 'utf-8 and not 'chinese-big5. Are the codings intended to be ordered by priority? If so, should buffer-file-coding-system be at the front of the list if it's safe? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-23 7:43 ` Joseph Turner @ 2024-08-23 12:38 ` Eli Zaretskii 2024-08-23 16:59 ` Joseph Turner 0 siblings, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2024-08-23 12:38 UTC (permalink / raw) To: Joseph Turner; +Cc: emacs-devel, schwab, adam > From: Joseph Turner <joseph@ushin.org> > Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net > Date: Fri, 23 Aug 2024 00:43:52 -0700 > > I'm surprised that > > (with-temp-buffer > (insert "你好") > (set-buffer-file-coding-system 'chinese-big5) > (car (find-coding-systems-region (point-min) (point-max)))) > > returns 'utf-8 and not 'chinese-big5. What does coding-system-priority-list returns in your case? > Are the codings intended to be > ordered by priority? Yes. > If so, should buffer-file-coding-system be at the front of the list > if it's safe? How do you know it's safe? If your application needs to prefer buffer-file-coding-system, then you should see if buffer-file-coding-system is a member of the list returned by find-coding-systems-region, and if so, use that. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-23 12:38 ` Eli Zaretskii @ 2024-08-23 16:59 ` Joseph Turner 2024-08-23 17:35 ` Eli Zaretskii 0 siblings, 1 reply; 40+ messages in thread From: Joseph Turner @ 2024-08-23 16:59 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, schwab, adam Eli Zaretskii <eliz@gnu.org> writes: >> From: Joseph Turner <joseph@ushin.org> >> Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net >> Date: Fri, 23 Aug 2024 00:43:52 -0700 >> >> I'm surprised that >> >> (with-temp-buffer >> (insert "你好") >> (set-buffer-file-coding-system 'chinese-big5) >> (car (find-coding-systems-region (point-min) (point-max)))) >> >> returns 'utf-8 and not 'chinese-big5. > > What does coding-system-priority-list returns in your case? 'utf-8 >> Are the codings intended to be >> ordered by priority? > > Yes. > >> If so, should buffer-file-coding-system be at the front of the list >> if it's safe? > > How do you know it's safe? > > If your application needs to prefer buffer-file-coding-system, then > you should see if buffer-file-coding-system is a member of the list > returned by find-coding-systems-region, and if so, use that. I'd have thought that most applications would want to prefer buffer-file-coding-system if it's a member of the list returned by find-coding-systems-region, but perhaps not. I now have a clear path forward for hyperdrive.el. Thank you for your time, Eli!! Joseph ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-23 16:59 ` Joseph Turner @ 2024-08-23 17:35 ` Eli Zaretskii 2024-08-23 20:37 ` Joseph Turner 0 siblings, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2024-08-23 17:35 UTC (permalink / raw) To: Joseph Turner; +Cc: emacs-devel, schwab, adam > From: Joseph Turner <joseph@ushin.org> > Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net > Date: Fri, 23 Aug 2024 09:59:22 -0700 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> From: Joseph Turner <joseph@ushin.org> > >> Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net > >> Date: Fri, 23 Aug 2024 00:43:52 -0700 > >> > >> I'm surprised that > >> > >> (with-temp-buffer > >> (insert "你好") > >> (set-buffer-file-coding-system 'chinese-big5) > >> (car (find-coding-systems-region (point-min) (point-max)))) > >> > >> returns 'utf-8 and not 'chinese-big5. > > > > What does coding-system-priority-list returns in your case? > > 'utf-8 That explains what you see, then. > >> Are the codings intended to be > >> ordered by priority? > > > > Yes. > > > >> If so, should buffer-file-coding-system be at the front of the list > >> if it's safe? > > > > How do you know it's safe? > > > > If your application needs to prefer buffer-file-coding-system, then > > you should see if buffer-file-coding-system is a member of the list > > returned by find-coding-systems-region, and if so, use that. > > I'd have thought that most applications would want to prefer > buffer-file-coding-system if it's a member of the list returned by > find-coding-systems-region, but perhaps not. Most applications use select-safe-coding-system, which AFAIR already does all that. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-23 17:35 ` Eli Zaretskii @ 2024-08-23 20:37 ` Joseph Turner 0 siblings, 0 replies; 40+ messages in thread From: Joseph Turner @ 2024-08-23 20:37 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, schwab, adam Eli Zaretskii <eliz@gnu.org> writes: >> From: Joseph Turner <joseph@ushin.org> >> Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net >> Date: Fri, 23 Aug 2024 09:59:22 -0700 >> >> Eli Zaretskii <eliz@gnu.org> writes: >> >> >> From: Joseph Turner <joseph@ushin.org> >> >> Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net >> >> Date: Fri, 23 Aug 2024 00:43:52 -0700 >> >> >> >> I'm surprised that >> >> >> >> (with-temp-buffer >> >> (insert "你好") >> >> (set-buffer-file-coding-system 'chinese-big5) >> >> (car (find-coding-systems-region (point-min) (point-max)))) >> >> >> >> returns 'utf-8 and not 'chinese-big5. >> > >> > What does coding-system-priority-list returns in your case? >> >> 'utf-8 > > That explains what you see, then. > >> >> Are the codings intended to be >> >> ordered by priority? >> > >> > Yes. >> > >> >> If so, should buffer-file-coding-system be at the front of the list >> >> if it's safe? >> > >> > How do you know it's safe? >> > >> > If your application needs to prefer buffer-file-coding-system, then >> > you should see if buffer-file-coding-system is a member of the list >> > returned by find-coding-systems-region, and if so, use that. >> >> I'd have thought that most applications would want to prefer >> buffer-file-coding-system if it's a member of the list returned by >> find-coding-systems-region, but perhaps not. > > Most applications use select-safe-coding-system, which AFAIR already > does all that. Fantastic! Yes, it appears that select-safe-coding-system does account for buffer-file-coding-system. So, hyperdrive.el can just encode with (select-safe-coding-system (point-min) (point-max) nil nil FILENAME) Simple. Thank you!! Joseph ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-23 3:56 ` Joseph Turner 2024-08-23 7:02 ` Eli Zaretskii @ 2024-08-24 6:14 ` Joseph Turner 1 sibling, 0 replies; 40+ messages in thread From: Joseph Turner @ 2024-08-24 6:14 UTC (permalink / raw) To: emacs-devel; +Cc: schwab, adam Joseph Turner <joseph@ushin.org> writes: > > However AFAIK, there's no specified or de facto standard for storing > coding metadata in a hyperdrive, so this approach requires deliberation > first. I've made an issue on the `hypercore-fetch` repository: > > https://github.com/RangerMauve/hypercore-fetch/issues/100 If you're interested, here's a similar issue on the holepunch hyperdrive tracker about a standard for storing hyperdrive file encoding metadata: https://github.com/holepunchto/hyperdrive/issues/372 Thanks! Joseph ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-22 7:24 ` Joseph Turner 2024-08-22 11:04 ` Eli Zaretskii @ 2024-08-22 12:26 ` Adam Porter 2024-08-22 12:47 ` tomas 2024-08-22 13:50 ` Eli Zaretskii 1 sibling, 2 replies; 40+ messages in thread From: Adam Porter @ 2024-08-22 12:26 UTC (permalink / raw) To: Joseph Turner, Eli Zaretskii; +Cc: emacs-devel, schwab Hi Joseph, et al, On 8/22/24 02:24, Joseph Turner wrote: >>> plz.el does not manually encode buffer text *within Emacs* when sending >>> requests to curl, but by default, plz.el sends data to curl with --data, >>> which tells curl to strip CR and newlines. With the :body-type 'binary >>> argument, plz.el instead uses --data-binary, which does no conversion. >> >> Newlines is a relatively minor issue (although it, too, needs to be >> considered). My main concern is with the text encoding. How can it >> be TRT to use 'binary when sending buffer text to curl? that would >> mean we are more-or-less always sending the internal representation of >> characters, which is superset of UTF-8. If the data was originally >> encoded in anything but UTF-8, reading it into Emacs and then sending >> it back will change the byte sequences from that other encoding to >> UTF-8. Moreover, 'binary does not guarantee that the result is valid >> UTF-8. >> >> So maybe I misunderstand how these plz.el facilities are used, but up >> front this sounds like a mistake. > > It could be. Eli, Adam, what do you think about the default coding > systems for encoding the request body in the attached patch? From an API perspective, I'm not sure. My idea for plz.el is to provide a simple, somewhat idiomatic Elisp API for making HTTP requests (and, of course, to make "correct" requests, in compliance with specifications and expectations). Given the relatively few clients of plz thus far, some issues are yet to be fully explored and developed, and encoding/decoding may be one of those rougher edges. For the use cases I'm aware of, it seems to work well and correctly, but there are undoubtedly improvements to be made. Encoding/decoding is not exactly a simple matter, especially with regard to API design. Ultimately, no library can abstract it away from users' need to understand it. And I want plz's API to not have to change any more than necessary over time, so I'd want to be very deliberate with any changes to it. So it's appealing to do as little as possible in this regard, leaving as much as possible to the upstream user to handle outside of plz. One way to do that is to do what hyperdrive.el is basically doing now, to tell plz to tell curl to handle the data as binary, i.e. to pass it through unchanged. But it seems that we haven't covered all of the bases with regard to these issues; rather, we have tested a subset of them that seem to work as expected. Also, where it's possible to make plz DTRT automatically, integrating naturally with Elisp APIs and data structures, I'm certainly in favor of that. So, e.g. automatically using a buffer's expected encoding when passing its data to curl seems like the right thing to do, which plz doesn't do yet (and perhaps we could do the same thing when returning a buffer of data). Of course, AFAIK we can't do such a thing when passing a string, so I guess the most we can do there is document recommended patterns for the user; IOW I'm tempted to leave encoding of strings to the user rather than add another argument for that, but we can talk about it. Thanks, Adam ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-22 12:26 ` Adam Porter @ 2024-08-22 12:47 ` tomas 2024-08-23 6:28 ` Adam Porter 2024-08-22 13:50 ` Eli Zaretskii 1 sibling, 1 reply; 40+ messages in thread From: tomas @ 2024-08-22 12:47 UTC (permalink / raw) To: Adam Porter; +Cc: Joseph Turner, Eli Zaretskii, emacs-devel, schwab [-- Attachment #1: Type: text/plain, Size: 1280 bytes --] On Thu, Aug 22, 2024 at 07:26:58AM -0500, Adam Porter wrote: [...] > From an API perspective, I'm not sure. My idea for plz.el is to provide a > simple, somewhat idiomatic Elisp API for making HTTP requests (and, of > course, to make "correct" requests, in compliance with specifications and > expectations). Given the relatively few clients of plz thus far, some > issues are yet to be fully explored and developed, and encoding/decoding may > be one of those rougher edges. For the use cases I'm aware of, it seems to > work well and correctly, but there are undoubtedly improvements to be made. Another point I haven't seen in this discussion is that HTTP also may carry metadata about what it thinks the content encoding is. This may involve the server configuration too. You may choose to ignore it, but then you need to convince all the moving parts to agree on that (i.e. an Apache on the other side might happily tell your client that it is sending "text/plain; charset=UTF-8" or something similarly funny (note: UTF-8 isn't a charset :-) depending on the web server config, the content of some mimetype database and so on. You'll have to make sure curl and Emacs do the right thing with that (which might be "nothing"). Cheers -- t [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-22 12:47 ` tomas @ 2024-08-23 6:28 ` Adam Porter 0 siblings, 0 replies; 40+ messages in thread From: Adam Porter @ 2024-08-23 6:28 UTC (permalink / raw) To: tomas; +Cc: Joseph Turner, Eli Zaretskii, emacs-devel, schwab On 8/22/24 07:47, tomas@tuxteam.de wrote: > Another point I haven't seen in this discussion is that HTTP also may > carry metadata about what it thinks the content encoding is. This may > involve the server configuration too. > > You may choose to ignore it, but then you need to convince all the > moving parts to agree on that (i.e. an Apache on the other side might > happily tell your client that it is sending "text/plain; charset=UTF-8" > or something similarly funny (note: UTF-8 isn't a charset :-) depending > on the web server config, the content of some mimetype database and > so on. > > You'll have to make sure curl and Emacs do the right thing with that > (which might be "nothing"). plz optionally decodes response bodies according to the content-type header, depending on the :decode argument (although Joseph found a bug in the implementation of that, and I'll merge his fix soon). plz does not do anything else, e.g. setting a buffer's coding system variables. If :decode is nil, plz does no decoding and leaves it to the user. For HTML, with its potential for having a META tag that may also specify encoding, that would be up to the user; plz offers no special support for various content types. If a user wants to handle these issues manually, the ":as 'response" argument to plz should be used, which allows the user to process the response headers and body directly. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-22 12:26 ` Adam Porter 2024-08-22 12:47 ` tomas @ 2024-08-22 13:50 ` Eli Zaretskii 2024-08-23 6:31 ` Adam Porter 1 sibling, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2024-08-22 13:50 UTC (permalink / raw) To: Adam Porter; +Cc: joseph, emacs-devel, schwab > Date: Thu, 22 Aug 2024 07:26:58 -0500 > Cc: emacs-devel@gnu.org, schwab@suse.de > From: Adam Porter <adam@alphapapa.net> > > > It could be. Eli, Adam, what do you think about the default coding > > systems for encoding the request body in the attached patch? > > From an API perspective, I'm not sure. My idea for plz.el is to > provide a simple, somewhat idiomatic Elisp API for making HTTP requests > (and, of course, to make "correct" requests, in compliance with > specifications and expectations). Given the relatively few clients of > plz thus far, some issues are yet to be fully explored and developed, > and encoding/decoding may be one of those rougher edges. For the use > cases I'm aware of, it seems to work well and correctly, but there are > undoubtedly improvements to be made. > > Encoding/decoding is not exactly a simple matter, especially with regard > to API design. Ultimately, no library can abstract it away from users' > need to understand it. And I want plz's API to not have to change any > more than necessary over time, so I'd want to be very deliberate with > any changes to it. So it's appealing to do as little as possible in > this regard, leaving as much as possible to the upstream user to handle > outside of plz. But AFAICT, plz.el does decode the stuff it gets from curl, which doesn't seem to be consistent with what you say above. If plz.el would accept unibyte text and return unibyte text, that would be consistent: it would mean that any callers of plz.el need to do encoding and decoding themselves. But that doesn't seem to be the case now. Am I missing something? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-22 13:50 ` Eli Zaretskii @ 2024-08-23 6:31 ` Adam Porter 2024-08-23 6:51 ` Eli Zaretskii 2024-08-23 7:07 ` Joseph Turner 0 siblings, 2 replies; 40+ messages in thread From: Adam Porter @ 2024-08-23 6:31 UTC (permalink / raw) To: Eli Zaretskii; +Cc: joseph, emacs-devel, schwab On 8/22/24 08:50, Eli Zaretskii wrote: > But AFAICT, plz.el does decode the stuff it gets from curl, which > doesn't seem to be consistent with what you say above. If plz.el > would accept unibyte text and return unibyte text, that would be > consistent: it would mean that any callers of plz.el need to do > encoding and decoding themselves. But that doesn't seem to be the > case now. > > Am I missing something? Yes, the :decode argument to plz. If :decode is nil (or if ":as 'binary" is specified, which sets :decode to nil), plz does not decode the response body. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-23 6:31 ` Adam Porter @ 2024-08-23 6:51 ` Eli Zaretskii 2024-08-23 7:07 ` Joseph Turner 1 sibling, 0 replies; 40+ messages in thread From: Eli Zaretskii @ 2024-08-23 6:51 UTC (permalink / raw) To: Adam Porter; +Cc: joseph, emacs-devel, schwab > Date: Fri, 23 Aug 2024 01:31:16 -0500 > Cc: joseph@ushin.org, emacs-devel@gnu.org, schwab@suse.de > From: Adam Porter <adam@alphapapa.net> > > On 8/22/24 08:50, Eli Zaretskii wrote: > > > But AFAICT, plz.el does decode the stuff it gets from curl, which > > doesn't seem to be consistent with what you say above. If plz.el > > would accept unibyte text and return unibyte text, that would be > > consistent: it would mean that any callers of plz.el need to do > > encoding and decoding themselves. But that doesn't seem to be the > > case now. > > > > Am I missing something? > > Yes, the :decode argument to plz. If :decode is nil (or if ":as > 'binary" is specified, which sets :decode to nil), plz does not decode > the response body. That is obviously NOT the part I was missing... ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-23 6:31 ` Adam Porter 2024-08-23 6:51 ` Eli Zaretskii @ 2024-08-23 7:07 ` Joseph Turner 2024-08-23 7:58 ` Joseph Turner 1 sibling, 1 reply; 40+ messages in thread From: Joseph Turner @ 2024-08-23 7:07 UTC (permalink / raw) To: Adam Porter; +Cc: Eli Zaretskii, emacs-devel, schwab [-- Attachment #1: Type: text/plain, Size: 1140 bytes --] Adam Porter <adam@alphapapa.net> writes: > On 8/22/24 08:50, Eli Zaretskii wrote: > >> But AFAICT, plz.el does decode the stuff it gets from curl, which >> doesn't seem to be consistent with what you say above. If plz.el >> would accept unibyte text and return unibyte text, that would be >> consistent: it would mean that any callers of plz.el need to do >> encoding and decoding themselves. But that doesn't seem to be the >> case now. >> Am I missing something? > > Yes, the :decode argument to plz. If :decode is nil (or if ":as > 'binary" is specified, which sets :decode to nil), plz does not decode > the response body. Currently, GET decodes by default while PUT does no encoding by default. IIUC, the suggestion is that GET and PUT requests either both handle coding by default or neither does by default. Currently, PUT requests which pass an unencoded buffer with multibyte characters currently send the internal Emacs multibyte representation. I'd be in favor of adding some automatic encoding handling for PUT requests so that most users don't have to think about it. Please see patch! (not tested yet) Best, Joseph [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-Add-plz-ENCODE-argument-Add-default-encoding.patch --] [-- Type: text/x-diff, Size: 3932 bytes --] From 6a8f13fa799f4ba8b64effe229379dc54ef19c91 Mon Sep 17 00:00:00 2001 From: Joseph Turner <joseph@breatheoutbreathe.in> Date: Thu, 22 Aug 2024 00:02:19 -0700 Subject: [PATCH] Add: (plz) ENCODE argument; Add default encoding Previously, PUT requests which pass an unencoded string or buffer with multibyte characters sent the internal Emacs multibyte representation. Now strings and buffers are encoded by default, and the ENCODE nil argument (or :BODY-TYPE 'binary) can used when the user wants to handle encoding. WIP --- plz.el | 28 ++++++++++++++++++++++++---- 1 file changed, 24 insertions(+), 4 deletions(-) diff --git a/plz.el b/plz.el index 903d71e..ffbbe0b 100644 --- a/plz.el +++ b/plz.el @@ -325,7 +325,8 @@ (defalias 'plz--generate-new-buffer (cl-defun plz (method url &rest rest &key headers body else filter finally noquery timeout (as 'string) (then 'sync) - (body-type 'text) (decode t decode-s) + (body-type 'text) (encode t encode-s) + (decode t decode-s) (connect-timeout plz-connect-timeout)) "Request METHOD from URL with curl. Return the curl process object or, for a synchronous request, the @@ -340,6 +341,11 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque BODY-TYPE may be `text' to send BODY as text, or `binary' to send it as binary. +If ENCODE is non-nil, BODY is encoded automatically. For binary +content, it should be nil. When BODY-TYPE is `binary', ENCODE is +automatically set to nil. ENCODE has no effect when BODY is a +list like `(file FILENAME)'. + AS selects the kind of result to pass to the callback function THEN, or the kind of result to return for synchronous requests. It may be: @@ -416,6 +422,8 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque (declare (indent defun)) (setf decode (if (and decode-s (not decode)) nil decode)) + (setf encode (if (and encode-s (not encode)) + nil encode)) ;; NOTE: By default, for PUT requests and POST requests >1KB, curl sends an ;; "Expect:" header, which causes servers to send a "100 Continue" response, which ;; we don't want to have to deal with, so we disable it by setting the header to @@ -473,6 +481,9 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque (decode (pcase as ('binary nil) (_ decode))) + (encode (pcase body-type + ('binary nil) + (_ encode))) (default-directory ;; Avoid making process in a nonexistent directory (in case the current ;; default-directory has since been removed). It's unclear what the best @@ -553,9 +564,18 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque (process-send-string process curl-config) (when body (cl-typecase body - (string (process-send-string process body)) - (buffer (with-current-buffer body - (process-send-region process (point-min) (point-max)))))) + (string (process-send-string + process (if encode + (encode-coding-string + body (cdr default-process-coding-system)) + body))) + (buffer (if encode + (with-temp-buffer + (insert-buffer-substring-no-properties body) + (encode-coding-region (point-min) (point-max) body-coding) + (process-send-region process (point-min) (point-max))) + (with-current-buffer body + (process-send-region process (point-min) (point-max))))))) (process-send-eof process) (if sync-p (unwind-protect -- 2.41.0 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-23 7:07 ` Joseph Turner @ 2024-08-23 7:58 ` Joseph Turner 0 siblings, 0 replies; 40+ messages in thread From: Joseph Turner @ 2024-08-23 7:58 UTC (permalink / raw) To: Adam Porter; +Cc: Eli Zaretskii, emacs-devel, schwab [-- Attachment #1: Type: text/plain, Size: 134 bytes --] Joseph Turner <joseph@ushin.org> writes: > Please see patch! (not tested yet) I made a typo in the last patch. Here's a new one. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-Add-plz-ENCODE-argument-Add-default-encoding.patch --] [-- Type: text/x-diff, Size: 3981 bytes --] From 9ff971c6bbf00ebfe33a6e8993a006a168b4c6cb Mon Sep 17 00:00:00 2001 From: Joseph Turner <joseph@breatheoutbreathe.in> Date: Thu, 22 Aug 2024 00:02:19 -0700 Subject: [PATCH] Add: (plz) ENCODE argument; Add default encoding Previously, PUT requests which pass an unencoded string or buffer with multibyte characters sent the internal Emacs multibyte representation. Now strings and buffers are encoded by default, and the ENCODE nil argument (or :BODY-TYPE 'binary) can used when the user wants to handle encoding. WIP --- plz.el | 29 +++++++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-) diff --git a/plz.el b/plz.el index 903d71e..2a2077d 100644 --- a/plz.el +++ b/plz.el @@ -325,7 +325,8 @@ (defalias 'plz--generate-new-buffer (cl-defun plz (method url &rest rest &key headers body else filter finally noquery timeout (as 'string) (then 'sync) - (body-type 'text) (decode t decode-s) + (body-type 'text) (encode t encode-s) + (decode t decode-s) (connect-timeout plz-connect-timeout)) "Request METHOD from URL with curl. Return the curl process object or, for a synchronous request, the @@ -340,6 +341,11 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque BODY-TYPE may be `text' to send BODY as text, or `binary' to send it as binary. +If ENCODE is non-nil, BODY is encoded automatically. For binary +content, it should be nil. When BODY-TYPE is `binary', ENCODE is +automatically set to nil. ENCODE has no effect when BODY is a +list like `(file FILENAME)'. + AS selects the kind of result to pass to the callback function THEN, or the kind of result to return for synchronous requests. It may be: @@ -416,6 +422,8 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque (declare (indent defun)) (setf decode (if (and decode-s (not decode)) nil decode)) + (setf encode (if (and encode-s (not encode)) + nil encode)) ;; NOTE: By default, for PUT requests and POST requests >1KB, curl sends an ;; "Expect:" header, which causes servers to send a "100 Continue" response, which ;; we don't want to have to deal with, so we disable it by setting the header to @@ -473,6 +481,9 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque (decode (pcase as ('binary nil) (_ decode))) + (encode (pcase body-type + ('binary nil) + (_ encode))) (default-directory ;; Avoid making process in a nonexistent directory (in case the current ;; default-directory has since been removed). It's unclear what the best @@ -553,9 +564,19 @@ (cl-defun plz (method url &rest rest &key headers body else filter finally noque (process-send-string process curl-config) (when body (cl-typecase body - (string (process-send-string process body)) - (buffer (with-current-buffer body - (process-send-region process (point-min) (point-max)))))) + (string (process-send-string + process (if encode + (encode-coding-string + body (cdr default-process-coding-system)) + body))) + (buffer (if encode + (with-temp-buffer + (insert-buffer-substring-no-properties body) + (encode-coding-region + (point-min) (point-max) (cdr default-process-coding-system)) + (process-send-region process (point-min) (point-max))) + (with-current-buffer body + (process-send-region process (point-min) (point-max))))))) (process-send-eof process) (if sync-p (unwind-protect -- 2.41.0 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-21 9:20 ` Joseph Turner 2024-08-21 17:47 ` Eli Zaretskii @ 2024-08-22 7:09 ` Andreas Schwab 2024-08-22 7:30 ` Joseph Turner 1 sibling, 1 reply; 40+ messages in thread From: Andreas Schwab @ 2024-08-22 7:09 UTC (permalink / raw) To: Joseph Turner; +Cc: Eli Zaretskii, emacs-devel, Adam Porter On Aug 21 2024, Joseph Turner wrote: > Let's say I have a `iso-latin-1'-encoded file "foo.txt" on my local > filesystem. I upload this encoded file to my hyperdrive by passing the > filename to curl, which uploads the bytes with no conversion. Then I > open the "foo.txt" hyperdrive file using hyperdrive.el, which receives > the contents via curl from the local HTTP server. In the hyperdrive > file buffer, buffer-file-coding-system should be `iso-latin-1' (right?). That depends on the coding system priorities. Since latin-1 cannot be identified unambiguously, only the priority can distinguish it from other 8-bit coding systems. Also, if the file contains only ASCII characters, buffer-file-coding-system will be set to undecided. -- Andreas Schwab, SUSE Labs, schwab@suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different." ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-22 7:09 ` Andreas Schwab @ 2024-08-22 7:30 ` Joseph Turner 2024-08-22 11:05 ` Eli Zaretskii 0 siblings, 1 reply; 40+ messages in thread From: Joseph Turner @ 2024-08-22 7:30 UTC (permalink / raw) To: Andreas Schwab; +Cc: Eli Zaretskii, emacs-devel, Adam Porter Andreas Schwab <schwab@suse.de> writes: > On Aug 21 2024, Joseph Turner wrote: > >> Let's say I have a `iso-latin-1'-encoded file "foo.txt" on my local >> filesystem. I upload this encoded file to my hyperdrive by passing the >> filename to curl, which uploads the bytes with no conversion. Then I >> open the "foo.txt" hyperdrive file using hyperdrive.el, which receives >> the contents via curl from the local HTTP server. In the hyperdrive >> file buffer, buffer-file-coding-system should be `iso-latin-1' (right?). > > That depends on the coding system priorities. Since latin-1 cannot be > identified unambiguously, only the priority can distinguish it from > other 8-bit coding systems. Also, if the file contains only ASCII > characters, buffer-file-coding-system will be set to undecided. Thank you! I think you just answered my question to Eli: > What fallback encoding should we use if there's no 'Content-Type' > charset and `set-auto-coding' returns nil? IIUC, there's no foolproof > way to guess the encoding of unknown bytes. Should we use (coding-system-priority-list t) as the final fallback? Joseph ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-22 7:30 ` Joseph Turner @ 2024-08-22 11:05 ` Eli Zaretskii 0 siblings, 0 replies; 40+ messages in thread From: Eli Zaretskii @ 2024-08-22 11:05 UTC (permalink / raw) To: Joseph Turner; +Cc: schwab, emacs-devel, adam > From: Joseph Turner <joseph@ushin.org> > Cc: Eli Zaretskii <eliz@gnu.org>, emacs-devel@gnu.org, Adam Porter > <adam@alphapapa.net> > Date: Thu, 22 Aug 2024 00:30:51 -0700 > > Should we use (coding-system-priority-list t) as the final fallback? detect-coding-region already does that. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-20 11:15 ` Eli Zaretskii 2024-08-21 9:20 ` Joseph Turner @ 2024-08-26 6:37 ` Joseph Turner 2024-08-26 6:49 ` Joseph Turner 2024-08-26 11:20 ` Eli Zaretskii 1 sibling, 2 replies; 40+ messages in thread From: Joseph Turner @ 2024-08-26 6:37 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: > you need to consider the encoding process: Emacs always encodes the > buffer text on output to the external world. If this is what you > want, then you need to use bufferpos-to-filepos, and make sure you > pass the correct coding-system argument to it. Will the following code ever signal an error? (bufferpos-to-filepos (point-max) 'exact (select-safe-coding-system (point-min) (point-max))) The `bufferpos-to-filepos' docstring says, "It is an error to request the ‘exact’ method when the buffer’s EOL format is not yet decided." IOW, does `select-safe-coding-system' always return an encoding which specifies EOL conversion? Thank you!! Joseph ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-26 6:37 ` Joseph Turner @ 2024-08-26 6:49 ` Joseph Turner 2024-08-26 11:22 ` Eli Zaretskii 2024-08-26 11:20 ` Eli Zaretskii 1 sibling, 1 reply; 40+ messages in thread From: Joseph Turner @ 2024-08-26 6:49 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Joseph Turner <joseph@ushin.org> writes: > Eli Zaretskii <eliz@gnu.org> writes: > >> you need to consider the encoding process: Emacs always encodes the >> buffer text on output to the external world. If this is what you >> want, then you need to use bufferpos-to-filepos, and make sure you >> pass the correct coding-system argument to it. > > Will the following code ever signal an error? > > (bufferpos-to-filepos > (point-max) 'exact > (select-safe-coding-system (point-min) (point-max))) > > The `bufferpos-to-filepos' docstring says, "It is an error to request > the ‘exact’ method when the buffer’s EOL format is not yet decided." > > IOW, does `select-safe-coding-system' always return an encoding which > specifies EOL conversion? Let me rephrase: I would like to get the size of a buffer's text encoded with the return value of select-safe-coding-system, which may return an encoding which does not specify EOL conversion. Is there any way to calculate the `exact' buffer text size using bufferpos-to-filepos? Or is `approximate' the only viable argument in this case? Thanks! Joseph ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-26 6:49 ` Joseph Turner @ 2024-08-26 11:22 ` Eli Zaretskii 2024-08-27 4:48 ` Joseph Turner 0 siblings, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2024-08-26 11:22 UTC (permalink / raw) To: Joseph Turner; +Cc: emacs-devel > From: Joseph Turner <joseph@ushin.org> > Cc: emacs-devel@gnu.org > Date: Sun, 25 Aug 2024 23:49:52 -0700 > > Joseph Turner <joseph@ushin.org> writes: > > > Eli Zaretskii <eliz@gnu.org> writes: > > > >> you need to consider the encoding process: Emacs always encodes the > >> buffer text on output to the external world. If this is what you > >> want, then you need to use bufferpos-to-filepos, and make sure you > >> pass the correct coding-system argument to it. > > > > Will the following code ever signal an error? > > > > (bufferpos-to-filepos > > (point-max) 'exact > > (select-safe-coding-system (point-min) (point-max))) > > > > The `bufferpos-to-filepos' docstring says, "It is an error to request > > the ‘exact’ method when the buffer’s EOL format is not yet decided." > > > > IOW, does `select-safe-coding-system' always return an encoding which > > specifies EOL conversion? > > Let me rephrase: I would like to get the size of a buffer's text encoded > with the return value of select-safe-coding-system, which may return an > encoding which does not specify EOL conversion. Is there any way to > calculate the `exact' buffer text size using bufferpos-to-filepos? > > Or is `approximate' the only viable argument in this case? Unless you must deal with exotic encodings (like iso-2022 and its derivatives), I suggest to always use 'approximate'. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-26 11:22 ` Eli Zaretskii @ 2024-08-27 4:48 ` Joseph Turner 0 siblings, 0 replies; 40+ messages in thread From: Joseph Turner @ 2024-08-27 4:48 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Joseph Turner <joseph@ushin.org> >> Cc: emacs-devel@gnu.org >> Date: Sun, 25 Aug 2024 23:49:52 -0700 >> >> Joseph Turner <joseph@ushin.org> writes: >> >> > Eli Zaretskii <eliz@gnu.org> writes: >> > >> >> you need to consider the encoding process: Emacs always encodes the >> >> buffer text on output to the external world. If this is what you >> >> want, then you need to use bufferpos-to-filepos, and make sure you >> >> pass the correct coding-system argument to it. >> > >> > Will the following code ever signal an error? >> > >> > (bufferpos-to-filepos >> > (point-max) 'exact >> > (select-safe-coding-system (point-min) (point-max))) >> > >> > The `bufferpos-to-filepos' docstring says, "It is an error to request >> > the ‘exact’ method when the buffer’s EOL format is not yet decided." >> > >> > IOW, does `select-safe-coding-system' always return an encoding which >> > specifies EOL conversion? >> >> Let me rephrase: I would like to get the size of a buffer's text encoded >> with the return value of select-safe-coding-system, which may return an >> encoding which does not specify EOL conversion. Is there any way to >> calculate the `exact' buffer text size using bufferpos-to-filepos? >> >> Or is `approximate' the only viable argument in this case? > > Unless you must deal with exotic encodings (like iso-2022 and its > derivatives), I suggest to always use 'approximate'. Thank you! I will do that. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-26 6:37 ` Joseph Turner 2024-08-26 6:49 ` Joseph Turner @ 2024-08-26 11:20 ` Eli Zaretskii 1 sibling, 0 replies; 40+ messages in thread From: Eli Zaretskii @ 2024-08-26 11:20 UTC (permalink / raw) To: Joseph Turner; +Cc: emacs-devel > From: Joseph Turner <joseph@ushin.org> > Cc: emacs-devel@gnu.org > Date: Sun, 25 Aug 2024 23:37:49 -0700 > > IOW, does `select-safe-coding-system' always return an encoding which > specifies EOL conversion? Yes, provided that either buffer-file-coding-system or its default value specify a particular EOL conversion. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: How to get buffer byte length (not number of characters)? 2024-08-20 7:10 How to get buffer byte length (not number of characters)? Joseph Turner 2024-08-20 7:51 ` Joseph Turner 2024-08-20 11:15 ` Eli Zaretskii @ 2024-08-20 11:24 ` Andreas Schwab 2 siblings, 0 replies; 40+ messages in thread From: Andreas Schwab @ 2024-08-20 11:24 UTC (permalink / raw) To: Joseph Turner; +Cc: Emacs Devel Mailing List On Aug 20 2024, Joseph Turner wrote: > How can I get a buffer's byte length without writing to a file? That looks like a XY problem. Why do you need to know that? -- Andreas Schwab, SUSE Labs, schwab@suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different." ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2024-08-27 4:48 UTC | newest] Thread overview: 40+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-08-20 7:10 How to get buffer byte length (not number of characters)? Joseph Turner 2024-08-20 7:51 ` Joseph Turner 2024-08-20 11:20 ` Eli Zaretskii 2024-08-20 11:15 ` Eli Zaretskii 2024-08-21 9:20 ` Joseph Turner 2024-08-21 17:47 ` Eli Zaretskii 2024-08-21 23:52 ` Joseph Turner 2024-08-22 4:06 ` Eli Zaretskii 2024-08-22 7:24 ` Joseph Turner 2024-08-22 11:04 ` Eli Zaretskii 2024-08-22 18:29 ` Joseph Turner 2024-08-22 18:44 ` Eli Zaretskii 2024-08-22 19:32 ` tomas 2024-08-23 3:56 ` Joseph Turner 2024-08-23 7:02 ` Eli Zaretskii 2024-08-23 7:37 ` Joseph Turner 2024-08-23 12:34 ` Eli Zaretskii 2024-08-23 7:43 ` Joseph Turner 2024-08-23 12:38 ` Eli Zaretskii 2024-08-23 16:59 ` Joseph Turner 2024-08-23 17:35 ` Eli Zaretskii 2024-08-23 20:37 ` Joseph Turner 2024-08-24 6:14 ` Joseph Turner 2024-08-22 12:26 ` Adam Porter 2024-08-22 12:47 ` tomas 2024-08-23 6:28 ` Adam Porter 2024-08-22 13:50 ` Eli Zaretskii 2024-08-23 6:31 ` Adam Porter 2024-08-23 6:51 ` Eli Zaretskii 2024-08-23 7:07 ` Joseph Turner 2024-08-23 7:58 ` Joseph Turner 2024-08-22 7:09 ` Andreas Schwab 2024-08-22 7:30 ` Joseph Turner 2024-08-22 11:05 ` Eli Zaretskii 2024-08-26 6:37 ` Joseph Turner 2024-08-26 6:49 ` Joseph Turner 2024-08-26 11:22 ` Eli Zaretskii 2024-08-27 4:48 ` Joseph Turner 2024-08-26 11:20 ` Eli Zaretskii 2024-08-20 11:24 ` Andreas Schwab
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).