[PATCH] Add a mechanism for passing unibyte strings from lisp to modules.

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
@ 2024-06-21 18:13 Brennan Vincent
  2024-06-21 18:13 ` Brennan Vincent
  2024-06-21 19:08 ` Eli Zaretskii
  0 siblings, 2 replies; 18+ messages in thread
From: Brennan Vincent @ 2024-06-21 18:13 UTC (permalink / raw)
  To: emacs-devel

Since the introduction of make_unibyte_string, it has been possible to pass
raw binary data from modules to lisp, but not the other way around
(except by using vectors of bytes, which is inefficient). This
patch implements that feature so that raw binary data can be sent both ways.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
  2024-06-21 18:13 [PATCH] Add a mechanism for passing unibyte strings from lisp to modules Brennan Vincent
@ 2024-06-21 18:13 ` Brennan Vincent
  2024-06-21 19:08 ` Eli Zaretskii
  1 sibling, 0 replies; 18+ messages in thread
From: Brennan Vincent @ 2024-06-21 18:13 UTC (permalink / raw)
  To: emacs-devel; +Cc: Brennan Vincent

From: Brennan Vincent <brennan@umanwizard.com>

* src/data.c
(unibyte-string-p): New function.

* src/emacs-module.c
(module_copy_unibyte_string_contents): New function.

* src/module-env-30.h: Expose the above to modules.

* doc/lispref/internals.texi: Document the above.
---
 doc/lispref/internals.texi | 28 ++++++++++++++++++++++++++++
 src/data.c                 | 13 +++++++++++++
 src/emacs-module.c         | 37 +++++++++++++++++++++++++++++++++++++
 src/module-env-30.h        | 18 ++++++++++++++++++
 4 files changed, 96 insertions(+)

diff --git a/doc/lispref/internals.texi b/doc/lispref/internals.texi
index a5480a9bf8a..282ee5e1746 100644
--- a/doc/lispref/internals.texi
+++ b/doc/lispref/internals.texi
@@ -1725,6 +1725,34 @@ Module Values
 the text copying.
 @end deftypefn
 
+@deftypefn Function bool copy_unibyte_string_contents (emacs_env *@var{env}, emacs_value @var{arg}, char *@var{buf}, ptrdiff_t *@var{len})
+This function stores the raw bytes of a unibyte Lisp string specified
+by @var{arg} in the array of @code{char} pointed by @var{buf}, which
+should have enough space to hold at least @code{*@var{len}} bytes,
+including the terminating null byte.  The argument @var{len} must not
+be a @code{NULL} pointer, and, when the function is called, it should
+point to a value that specifies the size of @var{buf} in bytes.
+
+If the buffer size specified by @code{*@var{len}} is large enough to
+hold the string's bytes, the function stores in @code{*@var{len}} the
+actual number of bytes copied to @var{buf}, including the terminating
+null byte, and returns @code{true}.  If the buffer is too small, the
+function raises the @code{args-out-of-range} error condition, stores
+the required number of bytes in @code{*@var{len}}, and returns
+@code{false}.  @xref{Module Nonlocal}, for how to handle pending error
+conditions.
+
+The argument @var{buf} can be a @code{NULL} pointer, in which case the
+function stores in @code{*@var{len}} the number of bytes required for
+storing the contents of @var{arg}, and returns @code{true}.  This is
+how you can determine the size of @var{buf} needed to store a
+particular string: first call @code{copy_unibyte_string_contents} with
+@code{NULL} as @var{buf}, then allocate enough memory to hold the
+number of bytes stored by the function in @code{*@var{len}}, and call
+the function again with non-@code{NULL} @var{buf} to actually perform
+the text copying.
+@end deftypefn
+
 @deftypefn Function emacs_value vec_get (emacs_env *@var{env}, emacs_value @var{vector}, ptrdiff_t @var{index})
 This function returns the element of @var{vector} at @var{index}.  The
 @var{index} of the first vector element is zero.  The function raises
diff --git a/src/data.c b/src/data.c
index 3490d4985c9..7d8d5a46779 100644
--- a/src/data.c
+++ b/src/data.c
@@ -429,6 +429,17 @@ DEFUN ("multibyte-string-p", Fmultibyte_string_p, Smultibyte_string_p,
   return Qnil;
 }
 
+DEFUN("unibyte-string-p", Funibyte_string_p, Sunibyte_string_p,
+      1, 1, 0,
+      doc: /* Return t if OBJECT is a unibyte string.
+Return nil if OBJECT is either a multibyte string, or not a string.  */)
+  (Lisp_Object object)
+{
+  if (STRINGP (object) && !STRING_MULTIBYTE (object))
+    return Qt;
+  return Qnil;
+}
+
 DEFUN ("char-table-p", Fchar_table_p, Schar_table_p, 1, 1, 0,
        doc: /* Return t if OBJECT is a char-table.  */)
   (Lisp_Object object)
@@ -4023,6 +4034,7 @@ syms_of_data (void)
   DEFSYM (Qnatnump, "natnump");
   DEFSYM (Qwholenump, "wholenump");
   DEFSYM (Qstringp, "stringp");
+  DEFSYM (Qunibyte_string_p, "unibyte-string-p");
   DEFSYM (Qarrayp, "arrayp");
   DEFSYM (Qsequencep, "sequencep");
   DEFSYM (Qbufferp, "bufferp");
@@ -4219,6 +4231,7 @@ #define PUT_ERROR(sym, tail, msg)			\
   defsubr (&Skeywordp);
   defsubr (&Sstringp);
   defsubr (&Smultibyte_string_p);
+  defsubr (&Sunibyte_string_p);
   defsubr (&Svectorp);
   defsubr (&Srecordp);
   defsubr (&Schar_table_p);
diff --git a/src/emacs-module.c b/src/emacs-module.c
index 08db39b0b0d..69192cd7fd2 100644
--- a/src/emacs-module.c
+++ b/src/emacs-module.c
@@ -769,6 +769,42 @@ module_make_float (emacs_env *env, double d)
   return value;
 }
 
+static bool
+module_copy_unibyte_string_contents (emacs_env *env, emacs_value value, char *buf,
+				     ptrdiff_t *len)
+{
+  MODULE_FUNCTION_BEGIN (false);
+  Lisp_Object lisp_str = value_to_lisp (value);
+  CHECK_TYPE (STRINGP (lisp_str) && !STRING_MULTIBYTE (lisp_str),
+	      Qunibyte_string_p, lisp_str);
+
+  ptrdiff_t raw_size = SBYTES (lisp_str);
+  ptrdiff_t required_buf_size = raw_size + 1;
+
+  if (buf == NULL)
+    {
+      *len = required_buf_size;
+      MODULE_INTERNAL_CLEANUP();
+      return true;
+    }
+
+  if (*len < required_buf_size)
+    {
+      ptrdiff_t actual = *len;
+      *len = required_buf_size;
+      args_out_of_range_3 (INT_TO_INTEGER (actual),
+                           INT_TO_INTEGER (required_buf_size),
+                           INT_TO_INTEGER (PTRDIFF_MAX));
+    }
+
+  *len = required_buf_size;
+  memcpy(buf, SDATA (lisp_str), required_buf_size);
+
+  MODULE_INTERNAL_CLEANUP();
+  return true;
+}
+
+
 static bool
 module_copy_string_contents (emacs_env *env, emacs_value value, char *buf,
 			     ptrdiff_t *len)
@@ -1568,6 +1604,7 @@ initialize_environment (emacs_env *env, struct emacs_env_private *priv)
   env->extract_float = module_extract_float;
   env->make_float = module_make_float;
   env->copy_string_contents = module_copy_string_contents;
+  env->copy_unibyte_string_contents = module_copy_unibyte_string_contents;
   env->make_string = module_make_string;
   env->make_unibyte_string = module_make_unibyte_string;
   env->make_user_ptr = module_make_user_ptr;
diff --git a/src/module-env-30.h b/src/module-env-30.h
index e75210c7f8e..5837bfbf195 100644
--- a/src/module-env-30.h
+++ b/src/module-env-30.h
@@ -1,3 +1,21 @@
   /* Add module environment functions newly added in Emacs 30 here.
      Before Emacs 30 is released, remove this comment and start
      module-env-31.h on the master branch.  */
+
+  /* Copy the content of the Lisp string VALUE to BUFFER as an utf8
+     null-terminated string.
+
+     SIZE must point to the total size of the buffer.  If BUFFER is
+     NULL or if SIZE is not big enough, write the required buffer size
+     to SIZE and return true.
+
+     Note that SIZE must include the last null byte (e.g. "abc" needs
+     a buffer of size 4).
+
+     Return true if the string was successfully copied.  */
+
+  bool (*copy_unibyte_string_contents) (emacs_env *env,
+					emacs_value value,
+					char *buf,
+					ptrdiff_t *len)
+    EMACS_ATTRIBUTE_NONNULL(1, 4);
-- 
2.41.0





^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
  2024-06-21 18:13 [PATCH] Add a mechanism for passing unibyte strings from lisp to modules Brennan Vincent
  2024-06-21 18:13 ` Brennan Vincent
@ 2024-06-21 19:08 ` Eli Zaretskii
  2024-06-21 20:14   ` Brennan Vincent
  1 sibling, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2024-06-21 19:08 UTC (permalink / raw)
  To: Brennan Vincent; +Cc: emacs-devel

> From: Brennan Vincent <brennan@umanwizard.com> 
> Date: Fri, 21 Jun 2024 14:13:14 -0400
> 
> Since the introduction of make_unibyte_string, it has been possible to pass
> raw binary data from modules to lisp, but not the other way around
> (except by using vectors of bytes, which is inefficient). This
> patch implements that feature so that raw binary data can be sent both ways.

Please describe the motivation and real-life use cases for this.

In general, we want to minimize the use of unibyte strings in Emacs.

I also don't understand the need for unibyte-string-p, since we
already have multibyte-string-p.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
  2024-06-21 19:08 ` Eli Zaretskii
@ 2024-06-21 20:14   ` Brennan Vincent
  2024-06-22  6:50     ` Eli Zaretskii
  0 siblings, 1 reply; 18+ messages in thread
From: Brennan Vincent @ 2024-06-21 20:14 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> On Jun 21, 2024, at 15:08, Eli Zaretskii <eliz@gnu.org> wrote:
> 
> 
>> 
>> From: Brennan Vincent <brennan@umanwizard.com>
>> Date: Fri, 21 Jun 2024 14:13:14 -0400
>> 
>> Since the introduction of make_unibyte_string, it has been possible to pass
>> raw binary data from modules to lisp, but not the other way around
>> (except by using vectors of bytes, which is inefficient). This
>> patch implements that feature so that raw binary data can be sent both ways.
> 
> Please describe the motivation and real-life use cases for this.

As far as I know, unibyte strings are the only efficient way to represent arbitrary binary buffers in emacs. If that’s not true, I’d be happy to be corrected.

I think there are many possible cases where module authors will want to communicate binary data, but I’ll just describe one (my own). I’m working on a major mode that reads ELF files (whose contents it stores in a unibyte buffer) and provides various features like disassembling code. To do this it passes chunks of code to a module which in turn passes them to the Capstone disassembly library. To do this without being able to pass unibyte strings, I have to take the string of bytes, expand it to a vector of bytes, pass that to the module, and have the module copy each byte back out in a loop. This is very inefficient.

> In general, we want to minimize the use of unibyte strings in Emacs.

Why? What else should be used instead to represent arbitrary bytes?

> 
> I also don't understand the need for unibyte-string-p, since we
> already have multibyte-string-p.

That’s fair, I only added it so I could use it as an argument to CHECK_TYPE.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
  2024-06-21 20:14   ` Brennan Vincent
@ 2024-06-22  6:50     ` Eli Zaretskii
       [not found]       ` <87o77t6lyn.fsf@taipei.mail-host-address-is-not-set>
  0 siblings, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2024-06-22  6:50 UTC (permalink / raw)
  To: Brennan Vincent; +Cc: emacs-devel

> From: Brennan Vincent <brennan@umanwizard.com>
> Date: Fri, 21 Jun 2024 16:14:05 -0400
> Cc: emacs-devel@gnu.org
> 
> > Please describe the motivation and real-life use cases for this.
> 
> As far as I know, unibyte strings are the only efficient way to represent arbitrary binary buffers in emacs. If that’s not true, I’d be happy to be corrected.
> 
> I think there are many possible cases where module authors will want to communicate binary data, but I’ll just describe one (my own). I’m working on a major mode that reads ELF files (whose contents it stores in a unibyte buffer) and provides various features like disassembling code. To do this it passes chunks of code to a module which in turn passes them to the Capstone disassembly library. To do this without being able to pass unibyte strings, I have to take the string of bytes, expand it to a vector of bytes, pass that to the module, and have the module copy each byte back out in a loop. This is very inefficient.

Why can't you have the module code itself read the file, instead of
getting the bytes from Emacs?  Passing large amounts of bytes from
Emacs to a module is a very inefficient way of talking to modules
anyway, because Emacs is not optimized for moving text to and fro in
the shape of Lisp strings.  To say nothing of the GC pressure you will
have in your mode, due to a constant consing of strings.  It is best
to avoid all that to begin with.

> > In general, we want to minimize the use of unibyte strings in Emacs.
> 
> Why?

Because dealing with unibyte text in Emacs is tricky and causes many
subtle bugs.

> What else should be used instead to represent arbitrary bytes?

Emacs is not a program to deal with raw bytes, except in rare
exceptional cases.  Dealing with binary data is definitely NOT one of
the exceptions I'd like to see in Emacs.  Emacs is primarily a
text-processing environment, so processing binary data is way off its
main purpose.

> > I also don't understand the need for unibyte-string-p, since we
> > already have multibyte-string-p.
> 
> That’s fair, I only added it so I could use it as an argument to CHECK_TYPE.

You can easily use CHECK_STRING, followed by checking that the string
is unibyte.

And here you already hit the first subtlety of using unibyte text in
Emacs:

  (multibyte-string-p (decode-coding-string "abcdefg" 'utf-8))
   => t

IOW, a plain-ASCII string can sometimes be a multibyte string, which
would fail your naïve test for no good reason.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
       [not found]       ` <87o77t6lyn.fsf@taipei.mail-host-address-is-not-set>
@ 2024-06-22 16:12         ` Eli Zaretskii
  2024-06-23 21:15           ` Andrea Corallo
  0 siblings, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2024-06-22 16:12 UTC (permalink / raw)
  To: Brennan Vincent, Stefan Kangas, Andrea Corallo; +Cc: emacs-devel

> From: "Brennan Vincent" <brennan@umanwizard.com>
> Date: Sat, 22 Jun 2024 11:22:56 -0400
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > Why can't you have the module code itself read the file, instead of
> > getting the bytes from Emacs?  Passing large amounts of bytes from
> > Emacs to a module is a very inefficient way of talking to modules
> > anyway, because Emacs is not optimized for moving text to and fro in
> > the shape of Lisp strings.  To say nothing of the GC pressure you will
> > have in your mode, due to a constant consing of strings.  It is best
> > to avoid all that to begin with.
> 
> Of course it's possible to do that, but I wanted to write my mode in
> elisp as much as possible and keep the C side minimal, simply because I
> find elisp a much more enjoyable language to use. But if
> you are opposed to adding this code I can go with that approach.
> 
> Another possibility which would avoid adding specifically
> unibyte-related surface area to the modules API would be to create an
> extended version of copy_string_contents which can take any coding
> system, rather than forcing UTF-8.
> 
> Would you be open to such an approach? If so, I will send an updated patch.

I very much dislike the idea of letting modules deal with unibyte
strings, for the reasons I explained.  Basically, it will open a large
Pandora box by allowing people who don't know enough about the
subtleties of unibyte strings in Emacs to write buggy modules which
will crash Emacs.

But let's hear the other co-maintainers.  Stefan and Andrea, what is
your POV on these issues?



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
  2024-06-22 16:12         ` Eli Zaretskii
@ 2024-06-23 21:15           ` Andrea Corallo
  2024-06-24 11:45             ` Eli Zaretskii
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Corallo @ 2024-06-23 21:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Brennan Vincent, Stefan Kangas, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: "Brennan Vincent" <brennan@umanwizard.com>
>> Date: Sat, 22 Jun 2024 11:22:56 -0400
>> 
>> Eli Zaretskii <eliz@gnu.org> writes:
>> 
>> > Why can't you have the module code itself read the file, instead of
>> > getting the bytes from Emacs?  Passing large amounts of bytes from
>> > Emacs to a module is a very inefficient way of talking to modules
>> > anyway, because Emacs is not optimized for moving text to and fro in
>> > the shape of Lisp strings.  To say nothing of the GC pressure you will
>> > have in your mode, due to a constant consing of strings.  It is best
>> > to avoid all that to begin with.
>> 
>> Of course it's possible to do that, but I wanted to write my mode in
>> elisp as much as possible and keep the C side minimal, simply because I
>> find elisp a much more enjoyable language to use. But if
>> you are opposed to adding this code I can go with that approach.
>> 
>> Another possibility which would avoid adding specifically
>> unibyte-related surface area to the modules API would be to create an
>> extended version of copy_string_contents which can take any coding
>> system, rather than forcing UTF-8.
>> 
>> Would you be open to such an approach? If so, I will send an updated patch.
>
> I very much dislike the idea of letting modules deal with unibyte
> strings, for the reasons I explained.  Basically, it will open a large
> Pandora box by allowing people who don't know enough about the
> subtleties of unibyte strings in Emacs to write buggy modules which
> will crash Emacs.
>
> But let's hear the other co-maintainers.  Stefan and Andrea, what is
> your POV on these issues?

I, for one, would be not too much worried.  People writing modules
should be already very responsible for what they write as they have
already plenty of ways to shoot in their feet 🤷.

Perhaps we could mitigate the risk with some doc/comment explaining the
specific usecase this interface is meant to serve so it's not miss-used?

  Andrea



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
  2024-06-23 21:15           ` Andrea Corallo
@ 2024-06-24 11:45             ` Eli Zaretskii
  2024-06-25 17:36               ` Brennan Vincent
  0 siblings, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2024-06-24 11:45 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: brennan, stefankangas, emacs-devel

> From: Andrea Corallo <acorallo@gnu.org>
> Cc: "Brennan Vincent" <brennan@umanwizard.com>,  Stefan Kangas
>  <stefankangas@gmail.com>,  emacs-devel@gnu.org
> Date: Sun, 23 Jun 2024 17:15:39 -0400
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> From: "Brennan Vincent" <brennan@umanwizard.com>
> >> Date: Sat, 22 Jun 2024 11:22:56 -0400
> >> 
> >> Eli Zaretskii <eliz@gnu.org> writes:
> >> 
> >> > Why can't you have the module code itself read the file, instead of
> >> > getting the bytes from Emacs?  Passing large amounts of bytes from
> >> > Emacs to a module is a very inefficient way of talking to modules
> >> > anyway, because Emacs is not optimized for moving text to and fro in
> >> > the shape of Lisp strings.  To say nothing of the GC pressure you will
> >> > have in your mode, due to a constant consing of strings.  It is best
> >> > to avoid all that to begin with.
> >> 
> >> Of course it's possible to do that, but I wanted to write my mode in
> >> elisp as much as possible and keep the C side minimal, simply because I
> >> find elisp a much more enjoyable language to use. But if
> >> you are opposed to adding this code I can go with that approach.
> >> 
> >> Another possibility which would avoid adding specifically
> >> unibyte-related surface area to the modules API would be to create an
> >> extended version of copy_string_contents which can take any coding
> >> system, rather than forcing UTF-8.
> >> 
> >> Would you be open to such an approach? If so, I will send an updated patch.
> >
> > I very much dislike the idea of letting modules deal with unibyte
> > strings, for the reasons I explained.  Basically, it will open a large
> > Pandora box by allowing people who don't know enough about the
> > subtleties of unibyte strings in Emacs to write buggy modules which
> > will crash Emacs.
> >
> > But let's hear the other co-maintainers.  Stefan and Andrea, what is
> > your POV on these issues?
> 
> I, for one, would be not too much worried.  People writing modules
> should be already very responsible for what they write as they have
> already plenty of ways to shoot in their feet 🤷.

The problem is that we get to clean up their mess in too many cases.
Especially when the package is on ELPA.

> Perhaps we could mitigate the risk with some doc/comment explaining the
> specific usecase this interface is meant to serve so it's not miss-used?

If we want to allow Emacs to send binary data, I'd rather come up with
a specialized interface to do just that.  Explaining the subtleties of
using unibyte text in Emacs is a tough job, since it involves a lot of
low-level technical details.  When unibyte text comes from encoding
human-readable text that is at least justified, since that's what
Emacs was designed to d, among other things.  But using Emacs as a
handy method of reading binary data, to avoid doing that in the module
itself, and asking us to add an interface for that use case is too
much for my palate.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
  2024-06-24 11:45             ` Eli Zaretskii
@ 2024-06-25 17:36               ` Brennan Vincent
  2024-06-26 12:26                 ` Eli Zaretskii
  0 siblings, 1 reply; 18+ messages in thread
From: Brennan Vincent @ 2024-06-25 17:36 UTC (permalink / raw)
  To: Eli Zaretskii, Andrea Corallo; +Cc: stefankangas, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Andrea Corallo <acorallo@gnu.org>
>> Cc: "Brennan Vincent" <brennan@umanwizard.com>,  Stefan Kangas
>>  <stefankangas@gmail.com>,  emacs-devel@gnu.org
>> Date: Sun, 23 Jun 2024 17:15:39 -0400
>> 
>> Eli Zaretskii <eliz@gnu.org> writes:
>> 
>> >> From: "Brennan Vincent" <brennan@umanwizard.com>
>> >> Date: Sat, 22 Jun 2024 11:22:56 -0400
>> >> 
>> >> Eli Zaretskii <eliz@gnu.org> writes:
>> >> 
>> >> > Why can't you have the module code itself read the file, instead of
>> >> > getting the bytes from Emacs?  Passing large amounts of bytes from
>> >> > Emacs to a module is a very inefficient way of talking to modules
>> >> > anyway, because Emacs is not optimized for moving text to and fro in
>> >> > the shape of Lisp strings.  To say nothing of the GC pressure you will
>> >> > have in your mode, due to a constant consing of strings.  It is best
>> >> > to avoid all that to begin with.
>> >> 
>> >> Of course it's possible to do that, but I wanted to write my mode in
>> >> elisp as much as possible and keep the C side minimal, simply because I
>> >> find elisp a much more enjoyable language to use. But if
>> >> you are opposed to adding this code I can go with that approach.
>> >> 
>> >> Another possibility which would avoid adding specifically
>> >> unibyte-related surface area to the modules API would be to create an
>> >> extended version of copy_string_contents which can take any coding
>> >> system, rather than forcing UTF-8.
>> >> 
>> >> Would you be open to such an approach? If so, I will send an updated patch.
>> >
>> > I very much dislike the idea of letting modules deal with unibyte
>> > strings, for the reasons I explained.  Basically, it will open a large
>> > Pandora box by allowing people who don't know enough about the
>> > subtleties of unibyte strings in Emacs to write buggy modules which
>> > will crash Emacs.
>> >
>> > But let's hear the other co-maintainers.  Stefan and Andrea, what is
>> > your POV on these issues?
>> 
>> I, for one, would be not too much worried.  People writing modules
>> should be already very responsible for what they write as they have
>> already plenty of ways to shoot in their feet 🤷.
>
> The problem is that we get to clean up their mess in too many cases.
> Especially when the package is on ELPA.
>
>> Perhaps we could mitigate the risk with some doc/comment explaining the
>> specific usecase this interface is meant to serve so it's not miss-used?
>
> If we want to allow Emacs to send binary data, I'd rather come up with
> a specialized interface to do just that.  Explaining the subtleties of
> using unibyte text in Emacs is a tough job, since it involves a lot of
> low-level technical details.  When unibyte text comes from encoding
> human-readable text that is at least justified, since that's what
> Emacs was designed to d, among other things.  But using Emacs as a
> handy method of reading binary data, to avoid doing that in the module
> itself, and asking us to add an interface for that use case is too
> much for my palate.

I think it would be great if emacs grew a specialized vector-of-bytes type.

BTW, I have already rewritten my mode to not attempt to pass data with
unibyte strings, and to read/write the file in C. So this is no longer
relevant to me personally. But I think other module writers will hit a
similar issue, and it will be good to have something in place for this
use case.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
  2024-06-25 17:36               ` Brennan Vincent
@ 2024-06-26 12:26                 ` Eli Zaretskii
  2024-06-26 12:39                   ` tomas
  0 siblings, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2024-06-26 12:26 UTC (permalink / raw)
  To: Brennan Vincent; +Cc: acorallo, stefankangas, emacs-devel

> From: "Brennan Vincent" <brennan@umanwizard.com>
> Cc: stefankangas@gmail.com, emacs-devel@gnu.org
> Date: Tue, 25 Jun 2024 13:36:31 -0400
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > If we want to allow Emacs to send binary data, I'd rather come up with
> > a specialized interface to do just that.  Explaining the subtleties of
> > using unibyte text in Emacs is a tough job, since it involves a lot of
> > low-level technical details.  When unibyte text comes from encoding
> > human-readable text that is at least justified, since that's what
> > Emacs was designed to d, among other things.  But using Emacs as a
> > handy method of reading binary data, to avoid doing that in the module
> > itself, and asking us to add an interface for that use case is too
> > much for my palate.
> 
> I think it would be great if emacs grew a specialized vector-of-bytes type.

How will it be different from the Lisp vectors we already have?



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
  2024-06-26 12:26                 ` Eli Zaretskii
@ 2024-06-26 12:39                   ` tomas
  2024-06-26 13:23                     ` Eli Zaretskii
  0 siblings, 1 reply; 18+ messages in thread
From: tomas @ 2024-06-26 12:39 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Brennan Vincent, acorallo, stefankangas, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 275 bytes --]

On Wed, Jun 26, 2024 at 03:26:52PM +0300, Eli Zaretskii wrote:

[...]

> > I think it would be great if emacs grew a specialized vector-of-bytes type.
> 
> How will it be different from the Lisp vectors we already have?

The box around every byte.

Cheers
-- 
t

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
  2024-06-26 12:39                   ` tomas
@ 2024-06-26 13:23                     ` Eli Zaretskii
  2024-06-26 13:33                       ` tomas
  0 siblings, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2024-06-26 13:23 UTC (permalink / raw)
  To: tomas; +Cc: brennan, acorallo, stefankangas, emacs-devel

> Date: Wed, 26 Jun 2024 14:39:30 +0200
> Cc: Brennan Vincent <brennan@umanwizard.com>, acorallo@gnu.org,
> 	stefankangas@gmail.com, emacs-devel@gnu.org
> From:  <tomas@tuxteam.de>
> 
> 
> On Wed, Jun 26, 2024 at 03:26:52PM +0300, Eli Zaretskii wrote:
> 
> [...]
> 
> > > I think it would be great if emacs grew a specialized vector-of-bytes type.
> > 
> > How will it be different from the Lisp vectors we already have?
> 
> The box around every byte.

What box?  Please tell more, as I don't think I follow.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
  2024-06-26 13:23                     ` Eli Zaretskii
@ 2024-06-26 13:33                       ` tomas
  2024-06-26 14:32                         ` Brennan Vincent
  2024-06-26 15:34                         ` Eli Zaretskii
  0 siblings, 2 replies; 18+ messages in thread
From: tomas @ 2024-06-26 13:33 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: brennan, acorallo, stefankangas, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1182 bytes --]

On Wed, Jun 26, 2024 at 04:23:46PM +0300, Eli Zaretskii wrote:
> > Date: Wed, 26 Jun 2024 14:39:30 +0200
> > Cc: Brennan Vincent <brennan@umanwizard.com>, acorallo@gnu.org,
> > 	stefankangas@gmail.com, emacs-devel@gnu.org
> > From:  <tomas@tuxteam.de>
> > 
> > 
> > On Wed, Jun 26, 2024 at 03:26:52PM +0300, Eli Zaretskii wrote:
> > 
> > [...]
> > 
> > > > I think it would be great if emacs grew a specialized vector-of-bytes type.
> > > 
> > > How will it be different from the Lisp vectors we already have?
> > 
> > The box around every byte.
> 
> What box?  Please tell more, as I don't think I follow.

Maybe I'm all wrong, but AFAIU, a vector can contain arbitrary Lisp
values. That makes 64bits/8bits plus boxing/unboxing (which is, I
assume, quick, but nonzero).

Having a specialized "array of bytes" (as there is one for bools)
might be beneficial for big arrays, and perhaps avoid big data moving
operations over the C/LISP fence.

I do understand your reservations, but I do understand the OP's
wish as well :-)

If at all, a "byte array" would be, of course, cleaner than a
unibyte string, with all its implicit magic.

Cheers
-- 
t

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
  2024-06-26 13:33                       ` tomas
@ 2024-06-26 14:32                         ` Brennan Vincent
  2024-06-26 15:53                           ` Eli Zaretskii
  2024-06-26 15:34                         ` Eli Zaretskii
  1 sibling, 1 reply; 18+ messages in thread
From: Brennan Vincent @ 2024-06-26 14:32 UTC (permalink / raw)
  To: tomas, Eli Zaretskii; +Cc: acorallo, stefankangas, emacs-devel

tomas@tuxteam.de writes:

> On Wed, Jun 26, 2024 at 04:23:46PM +0300, Eli Zaretskii wrote:
>> > Date: Wed, 26 Jun 2024 14:39:30 +0200
>> > Cc: Brennan Vincent <brennan@umanwizard.com>, acorallo@gnu.org,
>> > 	stefankangas@gmail.com, emacs-devel@gnu.org
>> > From:  <tomas@tuxteam.de>
>> > 
>> > 
>> > On Wed, Jun 26, 2024 at 03:26:52PM +0300, Eli Zaretskii wrote:
>> > 
>> > [...]
>> > 
>> > > > I think it would be great if emacs grew a specialized vector-of-bytes type.
>> > > 
>> > > How will it be different from the Lisp vectors we already have?
>> > 
>> > The box around every byte.
>> 
>> What box?  Please tell more, as I don't think I follow.
>
> Maybe I'm all wrong, but AFAIU, a vector can contain arbitrary Lisp
> values. That makes 64bits/8bits plus boxing/unboxing (which is, I
> assume, quick, but nonzero).
>

Yes, this was my reasoning as well.

(setq foo (make-vector 1000000000 #x00))

causes emacs to consume (at least) 8G of RAM, whereas the similar C
code:

#define SZ 1000000000
char *foo = malloc(SZ);
memset(foo, 0, SZ);

only consumes 1G.

> Having a specialized "array of bytes" (as there is one for bools)
> might be beneficial for big arrays, and perhaps avoid big data moving
> operations over the C/LISP fence.
>
> I do understand your reservations, but I do understand the OP's
> wish as well :-)
>
> If at all, a "byte array" would be, of course, cleaner than a
> unibyte string, with all its implicit magic.
>
> Cheers
> -- 
> t




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
  2024-06-26 13:33                       ` tomas
  2024-06-26 14:32                         ` Brennan Vincent
@ 2024-06-26 15:34                         ` Eli Zaretskii
  2024-06-27  3:36                           ` Brennan Vincent
  1 sibling, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2024-06-26 15:34 UTC (permalink / raw)
  To: tomas; +Cc: brennan, acorallo, stefankangas, emacs-devel

> Date: Wed, 26 Jun 2024 15:33:09 +0200
> From: tomas@tuxteam.de
> Cc: brennan@umanwizard.com, acorallo@gnu.org, stefankangas@gmail.com,
> 	emacs-devel@gnu.org
> 
> > > > How will it be different from the Lisp vectors we already have?
> > > 
> > > The box around every byte.
> > 
> > What box?  Please tell more, as I don't think I follow.
> 
> Maybe I'm all wrong, but AFAIU, a vector can contain arbitrary Lisp
> values. That makes 64bits/8bits plus boxing/unboxing (which is, I
> assume, quick, but nonzero).
> 
> Having a specialized "array of bytes" (as there is one for bools)
> might be beneficial for big arrays, and perhaps avoid big data moving
> operations over the C/LISP fence.

If you are saying that using 64-bit values there incurs a run-time
performance penalty, then accessing bytes does that as well.  Someone
should profile this and present evidence wrt the relative performance
of these, then we can discuss whether the penalty is real and whether
it is worth adding yet another data type to Emacs.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
  2024-06-26 14:32                         ` Brennan Vincent
@ 2024-06-26 15:53                           ` Eli Zaretskii
  0 siblings, 0 replies; 18+ messages in thread
From: Eli Zaretskii @ 2024-06-26 15:53 UTC (permalink / raw)
  To: Brennan Vincent; +Cc: tomas, acorallo, stefankangas, emacs-devel

> From: "Brennan Vincent" <brennan@umanwizard.com>
> Cc: acorallo@gnu.org, stefankangas@gmail.com, emacs-devel@gnu.org
> Date: Wed, 26 Jun 2024 10:32:48 -0400
> 
> > Maybe I'm all wrong, but AFAIU, a vector can contain arbitrary Lisp
> > values. That makes 64bits/8bits plus boxing/unboxing (which is, I
> > assume, quick, but nonzero).
> >
> 
> Yes, this was my reasoning as well.
> 
> (setq foo (make-vector 1000000000 #x00))
> 
> causes emacs to consume (at least) 8G of RAM, whereas the similar C
> code:
> 
> #define SZ 1000000000
> char *foo = malloc(SZ);
> memset(foo, 0, SZ);
> 
> only consumes 1G.

Why is that a problem?  (It's a good-faith question; I know that 8GB
is 8 times 1GB, but that cannot be the answer, when long-running Emacs
sessions get to several GB memory footprint already, and no one is
complaining.  And that's even before we ask why would you need 1
billion bytes in an application that reads executable files.)



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
  2024-06-26 15:34                         ` Eli Zaretskii
@ 2024-06-27  3:36                           ` Brennan Vincent
  2024-06-27  6:05                             ` Eli Zaretskii
  0 siblings, 1 reply; 18+ messages in thread
From: Brennan Vincent @ 2024-06-27  3:36 UTC (permalink / raw)
  To: Eli Zaretskii, tomas; +Cc: acorallo, stefankangas, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> Date: Wed, 26 Jun 2024 15:33:09 +0200
>> From: tomas@tuxteam.de
>> Cc: brennan@umanwizard.com, acorallo@gnu.org, stefankangas@gmail.com,
>> 	emacs-devel@gnu.org
>> 
>> > > > How will it be different from the Lisp vectors we already have?
>> > > 
>> > > The box around every byte.
>> > 
>> > What box?  Please tell more, as I don't think I follow.
>> 
>> Maybe I'm all wrong, but AFAIU, a vector can contain arbitrary Lisp
>> values. That makes 64bits/8bits plus boxing/unboxing (which is, I
>> assume, quick, but nonzero).
>> 
>> Having a specialized "array of bytes" (as there is one for bools)
>> might be beneficial for big arrays, and perhaps avoid big data moving
>> operations over the C/LISP fence.
>
> If you are saying that using 64-bit values there incurs a run-time
> performance penalty, then accessing bytes does that as well.  Someone
> should profile this and present evidence wrt the relative performance
> of these, then we can discuss whether the penalty is real and whether
> it is worth adding yet another data type to Emacs.

Sure, I wrote a quick benchmark that passes a 10MB buffer to a module
which just sums the bytes and returns and integer. It is about 200x
faster using a unibyte string (with my original patch) than a vector.

C code:

// Compile with gcc -O3 -fPIC -shared -o test-module.so test.c

#include <emacs-module.h>
#include <stdlib.h>

int plugin_is_GPL_compatible;

static emacs_value
Fcall_test(emacs_env *env, ptrdiff_t nargs, emacs_value args[], void *) EMACS_NOEXCEPT
{
    unsigned char sum = 0;
    emacs_value vec = args[0];
    size_t sz = env->vec_size(env, vec);
    for (int i = 0; i < sz; ++i)
         sum += env->extract_integer(env, env->vec_get(env, vec, i));
    return env->make_integer(env, sum);
}

static emacs_value
Fcall_test2(emacs_env *env, ptrdiff_t nargs, emacs_value args[], void *) EMACS_NOEXCEPT
{
    unsigned char sum = 0;
    emacs_value arr = args[0];
    char *buf;
    ptrdiff_t sz = 0;
    env->copy_unibyte_string_contents(env, arr, NULL, &sz);
    buf = malloc(sz);
    env->copy_unibyte_string_contents(env, arr, buf, &sz);
    for (int i = 0; i < sz - 1; ++i)
         sum += buf[i];
    return env->make_integer(env, sum);
}

/* bind c_func (native) to e_func (elisp) */
static void
bind(emacs_env *env, emacs_value (*c_func) (emacs_env *env,
                                            ptrdiff_t nargs,
                                            emacs_value args[],
                                            void *) EMACS_NOEXCEPT,
     const char *e_func,
     ptrdiff_t min_arity,
     ptrdiff_t max_arity,
     const char *doc,
     void *data)
{
    emacs_value fset_args[2];
    
    fset_args[0] = env->intern(env, e_func);
    fset_args[1] = env->make_function(env, min_arity, max_arity, c_func, doc, data);
    env->funcall(env, env->intern(env, "fset"), 2, fset_args);
}

int
emacs_module_init(struct emacs_runtime *ert)
{
    emacs_env *env = ert->get_environment(ert); 
    
    bind(env,
         Fcall_test, "btv--test", 1, 1,
         "test using vector",
         NULL);

    bind(env,
         Fcall_test2, "btv--test2", 1, 1,
         "test using byte array",
         NULL);

    emacs_value provide_arg = env->intern(env, "test-module");
    env->funcall(env, env->intern(env, "provide"), 1, &provide_arg);
    return 0;
}


Elisp code:

(require 'test-module)
(require 'benchmark)

(setq v (make-vector 10000001 37))
(setq v2 (make-string 10000001 37))

`(,(benchmark-elapse (btv--test v))
  ,(benchmark-elapse (btv--test2 v2)))


Result of evaluating elisp code:

(0.17861138 0.000805208)




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Add a mechanism for passing unibyte strings from lisp to modules.
  2024-06-27  3:36                           ` Brennan Vincent
@ 2024-06-27  6:05                             ` Eli Zaretskii
  0 siblings, 0 replies; 18+ messages in thread
From: Eli Zaretskii @ 2024-06-27  6:05 UTC (permalink / raw)
  To: Brennan Vincent; +Cc: tomas, acorallo, stefankangas, emacs-devel

> From: "Brennan Vincent" <brennan@umanwizard.com>
> Cc: acorallo@gnu.org, stefankangas@gmail.com, emacs-devel@gnu.org
> Date: Wed, 26 Jun 2024 23:36:16 -0400
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> Having a specialized "array of bytes" (as there is one for bools)
> >> might be beneficial for big arrays, and perhaps avoid big data moving
> >> operations over the C/LISP fence.
> >
> > If you are saying that using 64-bit values there incurs a run-time
> > performance penalty, then accessing bytes does that as well.  Someone
> > should profile this and present evidence wrt the relative performance
> > of these, then we can discuss whether the penalty is real and whether
> > it is worth adding yet another data type to Emacs.
> 
> Sure, I wrote a quick benchmark that passes a 10MB buffer to a module
> which just sums the bytes and returns and integer. It is about 200x
> faster using a unibyte string (with my original patch) than a vector.

That's not an interesting benchmark: you are doing everything in C.
Emacs modules are intended to allow doing stuff in Lisp, not in C.
Writing a program that compares memcpy against a loop that extracts
values from an array and then calls a function to assign those values,
is a strawman, IMO.

In addition, the measurements I was talking about were not to compare
strings vs vectors, they were to compare the vectors we already have
in Emacs Lisp with the proposed "vectors of bytes".  That was Tomas's
proposal.  Your benchmark doesn't make that comparison.



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-06-27  6:05 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-21 18:13 [PATCH] Add a mechanism for passing unibyte strings from lisp to modules Brennan Vincent
2024-06-21 18:13 ` Brennan Vincent
2024-06-21 19:08 ` Eli Zaretskii
2024-06-21 20:14   ` Brennan Vincent
2024-06-22  6:50     ` Eli Zaretskii
     [not found]       ` <87o77t6lyn.fsf@taipei.mail-host-address-is-not-set>
2024-06-22 16:12         ` Eli Zaretskii
2024-06-23 21:15           ` Andrea Corallo
2024-06-24 11:45             ` Eli Zaretskii
2024-06-25 17:36               ` Brennan Vincent
2024-06-26 12:26                 ` Eli Zaretskii
2024-06-26 12:39                   ` tomas
2024-06-26 13:23                     ` Eli Zaretskii
2024-06-26 13:33                       ` tomas
2024-06-26 14:32                         ` Brennan Vincent
2024-06-26 15:53                           ` Eli Zaretskii
2024-06-26 15:34                         ` Eli Zaretskii
2024-06-27  3:36                           ` Brennan Vincent
2024-06-27  6:05                             ` Eli Zaretskii

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.