From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.devel Subject: Re: creating unibyte strings Date: Fri, 22 Mar 2019 11:37:59 -0400 Message-ID: References: <83y3b4wdw9.fsf@gnu.org> <83tvhal45r.fsf@gnu.org> <83h8bwt1on.fsf@gnu.org> <83bm24t0hv.fsf@gnu.org> <83wokrs6en.fsf@gnu.org> <837ecrrqdm.fsf@gnu.org> <83sgvfq6yv.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="94100"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Mar 22 16:42:28 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1h7MJN-000OL3-Gv for ged-emacs-devel@m.gmane.org; Fri, 22 Mar 2019 16:42:26 +0100 Original-Received: from localhost ([127.0.0.1]:59103 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h7MJM-0007iW-FV for ged-emacs-devel@m.gmane.org; Fri, 22 Mar 2019 11:42:24 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:59405) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h7MFJ-0004zC-AL for emacs-devel@gnu.org; Fri, 22 Mar 2019 11:38:17 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1h7MFF-0006Ao-Ck for emacs-devel@gnu.org; Fri, 22 Mar 2019 11:38:12 -0400 Original-Received: from [195.159.176.226] (port=35754 helo=blaine.gmane.org) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1h7MFE-0005py-RM for emacs-devel@gnu.org; Fri, 22 Mar 2019 11:38:09 -0400 Original-Received: from list by blaine.gmane.org with local (Exim 4.89) (envelope-from ) id 1h7MFB-000Jrg-RO for emacs-devel@gnu.org; Fri, 22 Mar 2019 16:38:05 +0100 X-Injected-Via-Gmane: http://gmane.org/ Cancel-Lock: sha1:mLI4evadDgPYJeeYYfu7mcXa7JY= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 195.159.176.226 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:234591 Archived-At: [ Boy this discussion is really frustrating. I should have just added the damn feature and moved on. Now I'm stuck in this morass! ] >> But this has nothing to do with the modules API: it's not more tricky >> then when doing it purely in Elisp. Are you seriously suggesting we >> deprecate unibyte strings altogether? > We won't deprecate unibyte strings, but we decided long ago to > minimize their use. Minimize their use doesn't mean that the places where they are used are less important. Sometimes what you need is a unibyte string and nothing else will do. It also doesn't explain why you want to make it extra cumbersome for modules whereas Elisp can still do it conveniently. >> Then I don't know what subtleties you're talking about. >> Can you give some examples of the kinds of things you're thinking of? > String concatenation, for one. Regular expression search for another. > And those just the ones I thought about in the first 5 seconds. I don't see in which way these are better hidden for multibyte strings than they are for unibyte strings. >> >> > Instead, how about doing that via vectors of byte values? >> >> What's the advantage? That seems even more convoluted: create a Lisp >> >> vector of the right size (i.e. 8x the size of your string on a 64bit >> >> system), loop over your string turning each byte into a Lisp integer >> >> (with the reverted API, this involves allocation of an `emacs_value` >> >> box), then pass that to `concat`? >> > That's one way, but I'm sure I can come up with a simpler one. ;-) >> I'm all ears. > Provide an Emacs primitive for that, then at least some of the > awkwardness is gone. No matter the primitive you provide, it means that to build a unibyte Elisp strings out of a C char[], you're suggesting we go through an extra copy that uses up 8x the memory. With such inefficient interfaces, the whole idea of writing modules becomes completely unattractive: better write a separate application and communicate via pipes (then you can get unibyte strings in the natural way). > And/or use records. I don't understand what you mean by "use records". >> >> It's probably going to be even less efficient than going through utf-8 >> >> and back. >> > I doubt that. It's just an assignment. And it's a rare situation >> > anyway. >> Why do you think it's rare? > Because the number of Emacs features that require you to submit a > unibyte string is very small. Maybe rare in terms of number of lines of code that will want to do. But that doesn't mean rare in terms of number of times it'll be executed for a specific user, so performance considerations should apply. >> 2- the C side string contains text in latin-1, big5, younameit. >> The module API provides nothing convenient. Should we force our >> module to link to C-side coding-system libraries to convert to utf-8 >> before passing it on to the Elisp, even though Emacs already has all >> the needed facilities? Really? > > Yes, really. Why is that a problem? libiconv exists on every > platform we support, and is easy to use. Moreover, if you just want > to convert a native string into another native string, using Emacs > built-in en/decoding machinery is inconvenient, because it involves > more copying than necessary. The idea is not to use Emacs as a C library for text conversion, but that if you receive a latin-1 string and want to pass it to Emacs, it makes a lot of sense to do: make_bytestring (s) and later (decode-coding-string s) then having to link with libiconv. >> 3- The C side string contains binary data (say PNG images). >> What does "arrange for it to be UTF-8" even mean? > Nothing, since in this case there's no meaning to "decoding". My point exactly: what should be done instead? The solution currently used for this existing case is to call make_string on it (even though it's not a utf-8 string) and then pass it through (encode-coding-string s 'utf-8) which is ridiculously inefficient compared to what make_bytestring would do. Stefan