From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: creating unibyte strings Date: Fri, 22 Mar 2019 17:11:52 +0200 Message-ID: <83sgvfq6yv.fsf@gnu.org> References: <83y3b4wdw9.fsf@gnu.org> <83tvhal45r.fsf@gnu.org> <83h8bwt1on.fsf@gnu.org> <83bm24t0hv.fsf@gnu.org> <83wokrs6en.fsf@gnu.org> <837ecrrqdm.fsf@gnu.org> Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="239487"; mail-complaints-to="usenet@blaine.gmane.org" Cc: emacs-devel@gnu.org To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Mar 22 16:14:14 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1h7Ls6-0010B4-0a for ged-emacs-devel@m.gmane.org; Fri, 22 Mar 2019 16:14:14 +0100 Original-Received: from localhost ([127.0.0.1]:58750 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h7Ls4-0004YO-Sy for ged-emacs-devel@m.gmane.org; Fri, 22 Mar 2019 11:14:12 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:54121) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h7Lpx-0003mX-P3 for emacs-devel@gnu.org; Fri, 22 Mar 2019 11:12:04 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:43966) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h7Lpv-00012G-Bt; Fri, 22 Mar 2019 11:12:00 -0400 Original-Received: from [176.228.60.248] (port=4577 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1h7Lps-0003pE-I3; Fri, 22 Mar 2019 11:11:57 -0400 In-reply-to: (message from Stefan Monnier on Fri, 22 Mar 2019 10:23:20 -0400) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:234585 Archived-At: > From: Stefan Monnier > Date: Fri, 22 Mar 2019 10:23:20 -0400 > > >> I don't see what's subtle about "unibyte" strings, as long as you > >> understand that these are strings of *bytes* instead of strings > >> of *characters* (i.e. they're `int8[]` rather than `w_char_t[]`). > > > > That's the subtlety, right there. Handling such "strings" in Emacs > > Lisp can produce strange and unexpected results for someone who is not > > aware of the difference and its implications. > > But this has nothing to do with the modules API: it's not more tricky > then when doing it purely in Elisp. Are you seriously suggesting we > deprecate unibyte strings altogether? We won't deprecate unibyte strings, but we decided long ago to minimize their use. > >> "Multibyte" strings are just as subtle (maybe more so even), yet we > >> rightly don't hesitate to offer a primitive way to construct them. > > Because we succeed to hide the subtleties in that case, > > so the multibyte nature is not really visible on the Lisp level, > > unless you try very hard to make it so. > > Then I don't know what subtleties you're talking about. > Can you give some examples of the kinds of things you're thinking of? String concatenation, for one. Regular expression search for another. And those just the ones I thought about in the first 5 seconds. > >> > Instead, how about doing that via vectors of byte values? > >> What's the advantage? That seems even more convoluted: create a Lisp > >> vector of the right size (i.e. 8x the size of your string on a 64bit > >> system), loop over your string turning each byte into a Lisp integer > >> (with the reverted API, this involves allocation of an `emacs_value` > >> box), then pass that to `concat`? > > That's one way, but I'm sure I can come up with a simpler one. ;-) > > I'm all ears. Provide an Emacs primitive for that, then at least some of the awkwardness is gone. And/or use records. > >> It's probably going to be even less efficient than going through utf-8 > >> and back. > > I doubt that. It's just an assignment. And it's a rare situation > > anyway. > > Why do you think it's rare? Because the number of Emacs features that require you to submit a unibyte string is very small. > It's pretty common to receive non-utf-8 byte streams from the external world. > And when you do receive them, it can come at a very fast pace and become > temporarily anything but rare. Are you talking about text encoded in some non-UTF-8 encoding? If so, it should be converted to UTF-8, and that will solve the problem. If it isn't text, then what common use cases are you talking about? > 2- the C side string contains text in latin-1, big5, younameit. > The module API provides nothing convenient. Should we force our > module to link to C-side coding-system libraries to convert to utf-8 > before passing it on to the Elisp, even though Emacs already has all > the needed facilities? Really? Yes, really. Why is that a problem? libiconv exists on every platform we support, and is easy to use. Moreover, if you just want to convert a native string into another native string, using Emacs built-in en/decoding machinery is inconvenient, because it involves more copying than necessary. > 3- The C side string contains binary data (say PNG images). > What does "arrange for it to be UTF-8" even mean? Nothing, since in this case there's no meaning to "decoding".