From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: How to get buffer byte length (not number of characters)? Date: Thu, 22 Aug 2024 07:06:32 +0300 Message-ID: <86plq1td4n.fsf@gnu.org> References: <87wmkbekjp.fsf@ushin.org> <86o75nwilg.fsf@gnu.org> <87bk1lhkvg.fsf@ushin.org> <86y14pu5rp.fsf@gnu.org> <871q2hfn7c.fsf@ushin.org> Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="36521"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net To: Joseph Turner Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Aug 22 06:07:04 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1sgz6G-0009NM-IR for ged-emacs-devel@m.gmane-mx.org; Thu, 22 Aug 2024 06:07:04 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sgz5r-0007hp-Tu; Thu, 22 Aug 2024 00:06:39 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sgz5n-0007hc-PP for emacs-devel@gnu.org; Thu, 22 Aug 2024 00:06:35 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sgz5n-0000Hr-1R; Thu, 22 Aug 2024 00:06:35 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=References:Subject:In-Reply-To:To:From:Date: mime-version; bh=fFgAXej5DnvL8xT3po8Q8/MYxBMMoRWcuQkTPulypFU=; b=bxSmyRUtOycR nplZbbWrhln4a9HDcy8+eEv6RvBj7DT1xuUHCMl7ktS1rjz6qgDIUTlJV51Wfx9FzAkyQpnhsrWIw EOwXo0J5dFcLP4dq7Ft7oYTyd5kWSkkVqIpiPemFH+5b/sqt3P1dzUDqJXm07OfXSx1cYMuGAETU3 S3RRfOvoRjpyFyhKTEaMk/xUBN48lzWWePeK2FLp3feQqKWPhKciR9dUfaybnMLHPFrlQvGMqCB2w EuVC8BnnaoOC8r2W+qg6gSs6tl5EkOH2JESLqg42rkTWb4qfeHtlTjSPsCg/p4/HX/+zZgMdw06i8 ntXsgAxssdcKhAhDdwpuEw==; In-Reply-To: <871q2hfn7c.fsf@ushin.org> (message from Joseph Turner on Wed, 21 Aug 2024 16:52:39 -0700) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:323036 Archived-At: > From: Joseph Turner > Cc: emacs-devel@gnu.org, schwab@suse.de, adam@alphapapa.net > Date: Wed, 21 Aug 2024 16:52:39 -0700 > > Eli Zaretskii writes: > > >> Currently, plz.el always creates the curl subprocess like so: > >> > >> (make-process :coding 'binary ...) > >> > >> https://git.savannah.gnu.org/cgit/emacs/elpa.git/tree/plz.el?h=externals-release/plz#n519 > >> > >> Does this DTRT? > > > > It could be TRT if plz.el encodes the buffer text "by hand" before > > sending the results to curl and decodes it when it receives text from > > curl. Which I think is what happens there. > > plz.el does not manually encode buffer text *within Emacs* when sending > requests to curl, but by default, plz.el sends data to curl with --data, > which tells curl to strip CR and newlines. With the :body-type 'binary > argument, plz.el instead uses --data-binary, which does no conversion. Newlines is a relatively minor issue (although it, too, needs to be considered). My main concern is with the text encoding. How can it be TRT to use 'binary when sending buffer text to curl? that would mean we are more-or-less always sending the internal representation of characters, which is superset of UTF-8. If the data was originally encoded in anything but UTF-8, reading it into Emacs and then sending it back will change the byte sequences from that other encoding to UTF-8. Moreover, 'binary does not guarantee that the result is valid UTF-8. So maybe I misunderstand how these plz.el facilities are used, but up front this sounds like a mistake. > We don't want to strip newlines from hyperdrive files, so we always use > :body-type 'binary when sending buffer contents. Should hyperdrive.el > encode data with `buffer-file-coding-system' before passing to plz.el? I would think so, but maybe we should bring the plz.el developers on board of this discussion. > When receiving text from curl, plz.el optionally decodes the text > according to the charset in the 'Content-Type' header, e.g., "text/html; > charset=utf-8" or utf-8 if no charset is found. By "optionally" you mean that it doesn't always happen, except if the caller requests that? If so, the caller of plz.el should decode the text manually before using it in user-facing features. > Perhaps hyperdrive.el should check the 'Content-Type' header charset, > then fallback to guessing the coding system based on filename and file > contents with `set-auto-coding' (to avoid decoding images, etc.), and > then finally fallback to something else? Probably. But then I don't know anything about hyperdrive.el, either. If it copies text between files or URLs without showing it to the user, then the best strategy is indeed not to decode and encode stuff, but handle it as a stream of raw bytes. (In that case, my suggestion would be to use unibyte buffers and strings for temporarily storing and processing these raw bytes in Emacs.) But if the text is somehow shown to the user, it must be decoded to be displayed correctly by Emacs. And then it must be encoded back when writing it back to the external storage.