Hi Eli,

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Robert Pluim <rpluim@gmail.com>
>> Cc: Po Lu <luangruo@yahoo.com>
>> Date: Thu, 30 Mar 2023 11:34:42 +0200
>> 
>> Fstring_lessp has:
>> 
>> /* Check whether the platform allows access to unaligned addresses for
>>    size_t integers without trapping or undue penalty (a few cycles is OK).
>> 
>>    This whitelist is incomplete but since it is only used to improve
>>    performance, omitting cases is safe.  */
>> #if defined __x86_64__|| defined __amd64__	\
>>     || defined __i386__ || defined __i386	\
>>     || defined __arm64__ || defined __aarch64__	\
>>     || defined __powerpc__ || defined __powerpc	\
>>     || defined __ppc__ || defined __ppc		\
>>     || defined __s390__ || defined __s390x__
>> #define HAVE_FAST_UNALIGNED_ACCESS 1
>> #else
>> #define HAVE_FAST_UNALIGNED_ACCESS 0
>> #endif
>> 
>> but even if unaligned access is normally permitted by a machine, it is
>> still undefined behavior to dereference an unaligned pointer.
>
> This is incorrect.  There's nothing undefined about x86 unaligned
> accesses.  C standards can regard this as UB, but we are using
> machine-specific knowledge here

You're making a faulty assumption here, there's no guarantee that such
an access happens at all.

You're, of course, right in that an x86 CPU will have no (visible)
qualms about making such a mov, but you're also assuming that the
compiler emits a mov.  This is not guaranteed anywhere, and guaranteeing
so would be terrible for optimization in general.

As an example, the compiler is free to, for instance, vectorize a loop,
emitting instructions that very much have alignment checking even on
x86 (the loop in question is very much parallelizable and vectorizable,
as it feels like a textbook example of such operations).

> (and Emacs cannot be built with a strict adherence to C standards
> anyway).

That is indeed correct; there's, however, a difference in how necessary
it is here (and I argue it is not, with reasoning presented below).

>> Instead, HAVE_FAST_UNALIGNED_ACCESS and UNALIGNED_LOAD_SIZE should be
>> removed and memcpy used instead:
>> 
>>   word_t a, c;
>> 
>>   memcpy (&a, w1 + b / ws, sizeof a);
>>   memcpy (&c, w2 + b / ws, sizeof c);
>> 
>> doing so will make the compiler itself generate the right sequence of
>> instructions for performing unaligned accesses, normally with only a few
>> cycles penalty.
>
> We don't want that penalty here, that's all.

At any optimization level, you don't get one (on x86_64).  I haven't
checked -O0, as it's not worth using (rather, one should use
-O2/-O3/-Og/-Oz).

>> I would like to install such a change on emacs-29.
>
> No, please don't.
>
>> Emacs currently crashes when built with various compilers performing
>> pointer alignment checks.
>
> Details, please.  Which compilers, on what platforms, for what target
> architectures, etc.

Sam presented a decent example (though, sanitizers seem to have been
taken into account in this particular example).

> Unconditionally removing the fast copy there is a non-starter.

You're assuming that alternatives to these "fast" accesses are slow -
they are not.  The following code...

  int
  f_broken (void* x)
  {
      return *((int*)x);
  }
  
  int
  f (void* x)
  {
      int v;
      memcpy (&v, x, sizeof (v));
      return v;
  }

... generates the following code on gcc 12.2.0 with -O1...

  f_broken:
          movl    (%rdi), %eax
          ret
  f:
          movl    (%rdi), %eax
          ret

As a matter of fact, implementing a "skip common prefix" loop with just
chars results in code /shorter/ code on the same compiler (and does not
violate aliasing rules, since the data FAM is a char one).  Some other
portable methods could include Duff's device (using memcpy loads), or
word-size memcmp calls in a loop.

IMO, it is quite a fault in the compiler if Emacs needs to resort to
such hacks (and even if we accept that as something that is our problem,
we should have an abstraction boundary on it).

Note that I did not try hacking Emacs code to benchmark the actual thing
being discussed (as I am not in a position to do so conveniently at the
moment), but I invite you to try that and reconsider removing such code.
Even in the case there is a penalty to this change, I'd argue it is far
better for us to fix that in GCC or implement it a "skip common prefix"
function in Gnulib (so that it's behind a layer of abstraction) rather
than placing this assumption implicitly in this function.

I suspect the least intrusive change possible would emit the same code
as the current implementation, that change being merely using memcpy to
load the words rather than direct dereferences, except in the cases
where the current code is entirely broken, and correct code isn't.

Thanks in advance, have a lovely day.
-- 
Arsen Arsenović