From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.devel Subject: Re: creating unibyte strings Date: Fri, 22 Mar 2019 10:23:20 -0400 Message-ID: References: <83y3b4wdw9.fsf@gnu.org> <83tvhal45r.fsf@gnu.org> <83h8bwt1on.fsf@gnu.org> <83bm24t0hv.fsf@gnu.org> <83wokrs6en.fsf@gnu.org> <837ecrrqdm.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="66215"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Mar 22 15:35:02 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1h7LGA-000H54-7g for ged-emacs-devel@m.gmane.org; Fri, 22 Mar 2019 15:35:02 +0100 Original-Received: from localhost ([127.0.0.1]:58269 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h7LG9-0006ck-7m for ged-emacs-devel@m.gmane.org; Fri, 22 Mar 2019 10:35:01 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:41828) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h7L54-00054B-4M for emacs-devel@gnu.org; Fri, 22 Mar 2019 10:23:36 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1h7L52-0002Mj-4U for emacs-devel@gnu.org; Fri, 22 Mar 2019 10:23:34 -0400 Original-Received: from [195.159.176.226] (port=55614 helo=blaine.gmane.org) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1h7L51-0002I1-Ay for emacs-devel@gnu.org; Fri, 22 Mar 2019 10:23:32 -0400 Original-Received: from list by blaine.gmane.org with local (Exim 4.89) (envelope-from ) id 1h7L4x-0003KY-1g for emacs-devel@gnu.org; Fri, 22 Mar 2019 15:23:27 +0100 X-Injected-Via-Gmane: http://gmane.org/ Cancel-Lock: sha1:SPavs6rwZJ9CkJEGAmTOIt8d+xo= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 195.159.176.226 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:234573 Archived-At: >> >> Which reminds me: could someone add to the module API a primitive to >> >> build a *unibyte* string? >> > I don't like adding such a primitive. We don't want to proliferate >> > unibyte strings in Emacs through that back door, because manipulating >> > unibyte strings involves subtle issues many Lisp programmers are not >> > aware of. >> >> I don't see what's subtle about "unibyte" strings, as long as you >> understand that these are strings of *bytes* instead of strings >> of *characters* (i.e. they're `int8[]` rather than `w_char_t[]`). > > That's the subtlety, right there. Handling such "strings" in Emacs > Lisp can produce strange and unexpected results for someone who is not > aware of the difference and its implications. But this has nothing to do with the modules API: it's not more tricky then when doing it purely in Elisp. Are you seriously suggesting we deprecate unibyte strings altogether? >> "Multibyte" strings are just as subtle (maybe more so even), yet we >> rightly don't hesitate to offer a primitive way to construct them. > Because we succeed to hide the subtleties in that case, > so the multibyte nature is not really visible on the Lisp level, > unless you try very hard to make it so. Then I don't know what subtleties you're talking about. Can you give some examples of the kinds of things you're thinking of? >> > Instead, how about doing that via vectors of byte values? >> What's the advantage? That seems even more convoluted: create a Lisp >> vector of the right size (i.e. 8x the size of your string on a 64bit >> system), loop over your string turning each byte into a Lisp integer >> (with the reverted API, this involves allocation of an `emacs_value` >> box), then pass that to `concat`? > That's one way, but I'm sure I can come up with a simpler one. ;-) I'm all ears. >> It's probably going to be even less efficient than going through utf-8 >> and back. > I doubt that. It's just an assignment. And it's a rare situation > anyway. Why do you think it's rare? It's pretty common to receive non-utf-8 byte streams from the external world. And when you do receive them, it can come at a very fast pace and become temporarily anything but rare. >> Think about cases where the module receives byte strings from the disk >> or the network and need to pass that to `decode-coding-string`. >> And consider that we might be talking about megabytes of strings. > They don't need to decode, they just need to arrange for it to be > UTF-8. Three possibilities: 1- the C side string contains utf-8 text. The module API provides just the right operation, we're good to go. 2- the C side string contains text in latin-1, big5, younameit. The module API provides nothing convenient. Should we force our module to link to C-side coding-system libraries to convert to utf-8 before passing it on to the Elisp, even though Emacs already has all the needed facilities? Really? 3- The C side string contains binary data (say PNG images). What does "arrange for it to be UTF-8" even mean? -- Stefan PS: The PNG case is not hypothetical at all, it's what prompted my request.