all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Philipp Stephani <p.stephani2@gmail.com>
To: Eli Zaretskii <eliz@gnu.org>
Cc: emacs-devel@gnu.org
Subject: Re: String encoding in json.c
Date: Sat, 23 Dec 2017 15:31:06 +0000	[thread overview]
Message-ID: <CAArVCkQCbuE4o_oYyXRc-vjS0ppHLW5cL_wRLMzhT+iFqYUZRA@mail.gmail.com> (raw)
In-Reply-To: <83tvwhjyi5.fsf@gnu.org>

[-- Attachment #1: Type: text/plain, Size: 2784 bytes --]

Eli Zaretskii <eliz@gnu.org> schrieb am Sa., 23. Dez. 2017 um 15:43 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Sat, 23 Dec 2017 14:26:09 +0000
> >
> > I've benchmarked serialization and parsing of JSON with and without
> explicit encoding. I've found that leaving
> > out the coding makes both operations significantly faster – from a
> speedup of a factor of 1.11 ± 0.06 for
> > parsing canada.json to 1.57 ± 0.08 for serializing twitter.json. Other
> speedups are in between, but the
> > speedup is always significant (to at least one standard deviation). All
> unit tests pass when leaving out the
> > coding steps – which isn't surprising given that currently the coding
> operations are expensive no-ops.
>
> The coding operations are "expensive no-ops" except when they aren't,
> and that is exactly when we need their 'expensive" parts.
>

In which case are they not no-ops? I've spot-checked some of the
implementation details of coding.c, and I haven't found obvious cases where
they are not no-ops. Emacs appears to use the obvious extension of UTF-8
for integers that are not Unicode scalar values, and that's even documented
in character.h and the Elisp reference manual. Using utf-8-unix as encoding
seems to keep the encoding intact.


>
> > Therefore I'd suggest to document the internal string encoding in lisp.h
> or character.h and remove the explicit
> > coding in json.c and emacs-module.c. It's very unlikely that the
> internal string encoding will change frequently,
> > and if so, the unit tests should catch potential issues caused by that.
>
> As I've already said, I don't think this particular case should be an
> exception wrt to how Emacs behaves with external strings everywhere
> else.  We suffer similar slow-downs in those other places as well, and
> IMO this is a small penalty to pay for making sure our objects are
> valid and won't crash Emacs.
>

I've spot-checked some other code where we interface with external
libraries, namely dbusbind.c and gnutls.c. In no cases I've found explicit
coding operations (except for filenames, where the situation is different);
these files always use SDATA directly. dbusbind.c even has the comment

  /* We need to send a valid UTF-8 string.  We could encode `object'
     but by not encoding it, we guarantee it's valid utf-8, even if
     it contains eight-bit-bytes.  Of course, you can still send
     manually-crafted junk by passing a unibyte string.  */

So not only do we not encode strings explicitly, we even *prefer* not
encoding them, and we do rely on the internal string encoding being an
extension of UTF-8. It's the *current* json.c (and emacs-module.c) that's
inconsistent with the rest of the codebase.

[-- Attachment #2: Type: text/html, Size: 3622 bytes --]

  reply	other threads:[~2017-12-23 15:31 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-23 14:26 String encoding in json.c Philipp Stephani
2017-12-23 14:43 ` Eli Zaretskii
2017-12-23 15:31   ` Philipp Stephani [this message]
2017-12-23 15:53     ` Eli Zaretskii
2017-12-23 17:27       ` Philipp Stephani
2017-12-23 18:18         ` Eli Zaretskii
2017-12-26 21:42           ` Philipp Stephani
2017-12-27 16:08             ` Eli Zaretskii
2017-12-24 20:48   ` Dmitry Gutov
2017-12-25 16:21     ` Eli Zaretskii
2017-12-25 20:51       ` Dmitry Gutov
2017-12-26  4:35         ` Eli Zaretskii
2017-12-26 21:50           ` Philipp Stephani
2017-12-27  2:00             ` Dmitry Gutov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAArVCkQCbuE4o_oYyXRc-vjS0ppHLW5cL_wRLMzhT+iFqYUZRA@mail.gmail.com \
    --to=p.stephani2@gmail.com \
    --cc=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.