From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules. Date: Sat, 22 Jun 2024 09:50:07 +0300 Message-ID: <86frt5jwtc.fsf@gnu.org> References: <86v822jeqh.fsf@gnu.org> <225D336D-933E-4CA3-B245-89992D7E6C41@umanwizard.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="36032"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel@gnu.org To: Brennan Vincent Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sat Jun 22 08:50:50 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1sKuaI-0009CA-8i for ged-emacs-devel@m.gmane-mx.org; Sat, 22 Jun 2024 08:50:50 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sKuZi-0007pm-1m; Sat, 22 Jun 2024 02:50:14 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sKuZg-0007pN-2K for emacs-devel@gnu.org; Sat, 22 Jun 2024 02:50:12 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sKuZf-0003OF-Bz; Sat, 22 Jun 2024 02:50:11 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=TNeTUGNbQ79vDyGPAAdenCvjk4u8Ips2I1jGTN0q4Q8=; b=AvzNn+q1YNAJP0Len3oX J4XlSqhA/gC/3dPeBcvljFWnyKDagpRlXwQmMY3/87uzzCpt9Ffecc0zTE0NPBGTqUSd2Xsanldqh UmownOtXd1hALawMCSdvqpLJCtcQyV+iBvjdAhg1pAN7jF4A7a033dhzTOkILNQj0VS67pdb+Ymlr HsoSUx8QbaredHTiQgxK7p8hIxfDgOov3NUFBKW8AQcDm5g456QUmMC8pyawMeMV0miaz+H+hwtzZ G7nHexMr1bLDngfZT7tWXpfijgSjNyHKFoYtcu47H3WqpdRlmNFxmKoyBcQhwUtAtwu6/b8JmMjsu wdTvgWzEINSvZQ==; In-Reply-To: <225D336D-933E-4CA3-B245-89992D7E6C41@umanwizard.com> (message from Brennan Vincent on Fri, 21 Jun 2024 16:14:05 -0400) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:320457 Archived-At: > From: Brennan Vincent > Date: Fri, 21 Jun 2024 16:14:05 -0400 > Cc: emacs-devel@gnu.org > > > Please describe the motivation and real-life use cases for this. > > As far as I know, unibyte strings are the only efficient way to represent arbitrary binary buffers in emacs. If that’s not true, I’d be happy to be corrected. > > I think there are many possible cases where module authors will want to communicate binary data, but I’ll just describe one (my own). I’m working on a major mode that reads ELF files (whose contents it stores in a unibyte buffer) and provides various features like disassembling code. To do this it passes chunks of code to a module which in turn passes them to the Capstone disassembly library. To do this without being able to pass unibyte strings, I have to take the string of bytes, expand it to a vector of bytes, pass that to the module, and have the module copy each byte back out in a loop. This is very inefficient. Why can't you have the module code itself read the file, instead of getting the bytes from Emacs? Passing large amounts of bytes from Emacs to a module is a very inefficient way of talking to modules anyway, because Emacs is not optimized for moving text to and fro in the shape of Lisp strings. To say nothing of the GC pressure you will have in your mode, due to a constant consing of strings. It is best to avoid all that to begin with. > > In general, we want to minimize the use of unibyte strings in Emacs. > > Why? Because dealing with unibyte text in Emacs is tricky and causes many subtle bugs. > What else should be used instead to represent arbitrary bytes? Emacs is not a program to deal with raw bytes, except in rare exceptional cases. Dealing with binary data is definitely NOT one of the exceptions I'd like to see in Emacs. Emacs is primarily a text-processing environment, so processing binary data is way off its main purpose. > > I also don't understand the need for unibyte-string-p, since we > > already have multibyte-string-p. > > That’s fair, I only added it so I could use it as an argument to CHECK_TYPE. You can easily use CHECK_STRING, followed by checking that the string is unibyte. And here you already hit the first subtlety of using unibyte text in Emacs: (multibyte-string-p (decode-coding-string "abcdefg" 'utf-8)) => t IOW, a plain-ASCII string can sometimes be a multibyte string, which would fail your naïve test for no good reason.