unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: Camm Maguire <camm@maguirefamily.org>
Cc: gcl-devel@gnu.org, emacs-devel@gnu.org
Subject: Re: utf8 and emacs text/string multibyte representation
Date: Wed, 29 Oct 2014 18:19:07 +0200	[thread overview]
Message-ID: <83bnou26is.fsf@gnu.org> (raw)
In-Reply-To: <87oasu3m72.fsf@maguirefamily.org>

> From: Camm Maguire <camm@maguirefamily.org>
> Cc: emacs-devel@gnu.org,  gcl-devel@gnu.org
> Date: Wed, 29 Oct 2014 11:55:13 -0400
> 
> I thought there would be a little more on the upside, say some benefit
> from having the internal representation be the same as that used in many
> external representations, at least on linux

Yes, that too.  Emacs originally used a very different internal
encoding (ISO-2022 based), and the switch to UTF-8 based was due to
the above.  In general, having a Unicode basis works better when you
want to support various Unicode defined features, like the UCA etc.

> and perhaps some algorithm coalescing with straightforward byte-wise
> operations.

Not sure what you mean here, please elaborate.  In general, many
operations with UTF-8 strings can use the usual string library
functions, as you probably know very well.

> Does every string access in emacs proceed through the utf8 decoder?

If you need to look at the character, yes.  E.g., if you need some
property of the character, you need to index the appropriate table by
that character's codepoint.  But in most operations that is not
needed.  You just need to recognize several specific characters, like
the null character, the slash, etc., most of which are ASCII.

> >> A cached internal pointer storing the last referenced codepoint
> >> offset makes access essentially O(1).
> >
> > We indeed maintain a cache for byte-to-character and character-to-byte
> > conversions.
> 
> How big is this cache?

Its size is dynamic, and depends on how frequently the conversion is
needed in places that are far away.  The cache stores byte-to-char
correspondence in places that are far away, and Emacs uses binary
search in between them.

> >> Yet setting string elements can trigger reallocations/memmove
> >> operations.
> >
> > Emacs, as every editor, needs to handle this efficiently anyway,
> > because editing operations rarely leave the buffer size unchanged.  So
> > Emacs uses a gap to minimize reallocations.
> >
> 
> But no gap in strings, right (i.e. just buffers)?

Right.



  reply	other threads:[~2014-10-29 16:19 UTC|newest]

Thread overview: 136+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-28 22:33 Referring to revisions in the git future Alan Mackenzie
2014-10-28 22:54 ` Óscar Fuentes
2014-10-28 23:05   ` Alan Mackenzie
2014-10-28 23:24     ` Óscar Fuentes
2014-10-31 22:47     ` Paul Eggert
2014-10-29  0:49 ` Eric S. Raymond
2014-10-29  3:38   ` Stephen J. Turnbull
2014-10-29 12:26     ` Stefan Monnier
2014-10-29 12:41       ` Alexander Baier
2014-10-29 14:52   ` Barry Warsaw
2014-10-29 15:01     ` David Kastrup
2014-10-29 15:06       ` Eric S. Raymond
2014-10-29 18:12         ` Barry Warsaw
2014-10-29 22:09           ` Lars Magne Ingebrigtsen
2014-10-29 22:29             ` Eric S. Raymond
2014-10-29 23:31               ` Paul Eggert
2014-10-30  0:01                 ` Nic Ferrier
2014-10-30  1:53                 ` Stefan Monnier
2014-10-30  2:10                   ` Eric S. Raymond
2014-10-30  2:13                     ` Paul Eggert
2014-10-30  2:48                       ` Eric S. Raymond
2014-10-30  2:25                     ` Glenn Morris
2014-10-30 10:10                       ` David Kastrup
2014-10-30 13:03                         ` Stefan Monnier
2014-10-30 13:40                           ` David Kastrup
2014-10-30 14:00                             ` Stefan Monnier
2014-10-30 13:02                     ` Stefan Monnier
2014-10-30 15:12                       ` Eric S. Raymond
2014-10-30 16:49                         ` Stefan Monnier
2014-10-30  6:46                 ` Jan Djärv
2014-10-30  7:36                   ` Ivan Shmakov
2014-10-30  8:09                     ` Jan Djärv
2014-10-30  8:31                     ` Eric S. Raymond
2014-10-30  9:53                       ` Andreas Schwab
2014-10-30 10:13                         ` Eric S. Raymond
2014-10-30 10:32                           ` Andreas Schwab
2014-10-30 11:13                             ` Nicolas Richard
2014-10-30 10:12                       ` David Kastrup
2014-10-30 13:29                       ` Stefan Monnier
2014-10-30 15:33                         ` DVCS design philosophy Eric S. Raymond
2014-10-30 16:59                           ` Stefan Monnier
2014-10-30 17:41                             ` Eric S. Raymond
2014-10-31 20:18                         ` Referring to revisions in the git future Nicolas Richard
2014-10-31 21:11                           ` Stefan Monnier
2014-11-01  1:44                             ` Stephen J. Turnbull
2014-11-01  7:58                             ` David Kastrup
2014-10-30 14:20                       ` Barry Warsaw
2014-11-01  1:23                         ` Stephen J. Turnbull
2014-10-30 15:52                     ` Eli Zaretskii
2014-10-30  3:32           ` Stephen J. Turnbull
2014-10-30  4:35             ` Barry Warsaw
2014-10-30  5:24               ` Stephen J. Turnbull
2014-10-30 10:17               ` David Kastrup
2014-10-30 13:42               ` Alex Bennée
2014-10-30 13:19             ` Stefan Monnier
2014-10-31  6:36               ` Stephen J. Turnbull
2014-10-31 19:42               ` David Kastrup
2014-11-01  1:34                 ` Stephen J. Turnbull
2014-11-01  7:05                   ` Tassilo Horn
2014-11-01  7:09                     ` Dima Kogan
2014-11-01  7:28                     ` Paul Eggert
2014-11-01  7:49                     ` David Kastrup
2014-11-01  9:46                       ` Alan Mackenzie
2014-11-01 10:13                         ` Eli Zaretskii
2014-11-01 11:33                           ` Alan Mackenzie
2014-11-01 13:06                             ` Eli Zaretskii
2014-11-01 13:21                               ` Alan Mackenzie
2014-11-01 10:29                         ` David Kastrup
2014-11-01 11:29                           ` Alan Mackenzie
2014-11-01 11:57                             ` David Kastrup
2014-11-01 12:29                               ` Alan Mackenzie
2014-11-01 12:47                                 ` Ivan Shmakov
2014-11-01 13:46                                   ` Alan Mackenzie
2014-11-01 18:58                                     ` Stephen J. Turnbull
2014-11-01 12:49                                 ` David Kastrup
2014-10-29  1:11 ` Stefan Monnier
2014-10-29  6:06   ` Werner LEMBERG
2014-10-29  9:01     ` David Kastrup
2014-10-29  8:50 ` David Kastrup
2014-10-29  9:52   ` Eric S. Raymond
2014-10-29 11:00     ` David Kastrup
2014-10-29 14:32       ` Eli Zaretskii
2014-10-29 14:35         ` David Kastrup
2014-10-29 14:55           ` Eli Zaretskii
2014-10-30  4:44             ` Richard Stallman
2014-10-30  8:32               ` Eric S. Raymond
2014-10-30 10:25                 ` David Kastrup
2014-10-30 11:51                   ` Eric S. Raymond
2014-10-30 12:14                     ` David Kastrup
2014-10-30 15:01                       ` Eric S. Raymond
2014-10-30 15:53                 ` Eli Zaretskii
2014-10-30 15:56                   ` Eric S. Raymond
2014-10-30 16:44                     ` Eli Zaretskii
2014-10-31  7:47                 ` Richard Stallman
2014-10-31  8:17                   ` Eli Zaretskii
2014-10-31 10:21                   ` Eric S. Raymond
2014-10-29 12:35     ` Stefan Monnier
2014-10-29 13:00       ` Jose E. Marchesi
2014-10-29 13:59         ` Stefan Monnier
2014-10-29 14:39           ` Eric S. Raymond
2014-10-29 14:46             ` Rasmus
2014-10-29 14:52               ` Eric S. Raymond
2014-10-30  0:58               ` Rob Browning
2014-10-29 15:27             ` Stefan Monnier
2014-10-29 14:04         ` utf8 and emacs text/string multibyte representation Camm Maguire
2014-10-29 14:51           ` Eli Zaretskii
2014-10-29 15:55             ` Camm Maguire
2014-10-29 16:19               ` Eli Zaretskii [this message]
2014-10-30 14:13                 ` Camm Maguire
2014-10-30 16:06                   ` Eli Zaretskii
2014-10-30 16:27                     ` Camm Maguire
2014-10-30 16:35                       ` Eli Zaretskii
2014-10-31 18:05                         ` Camm Maguire
2014-11-01  9:01                           ` Eli Zaretskii
2014-11-01 18:32                             ` Stephen J. Turnbull
2014-11-01 18:41                               ` David Kastrup
2014-11-01 19:09                                 ` Stephen J. Turnbull
2014-11-02  0:56                                 ` Stefan Monnier
2014-11-01  1:16                         ` Stephen J. Turnbull
2014-10-29 16:45             ` Stefan Monnier
2014-10-29 15:56           ` Raymond Toy
2014-10-30 14:16             ` Camm Maguire
2014-10-31 18:47               ` Sam Steingold
2014-10-31 21:00                 ` Andreas Schwab
2014-10-31 19:52               ` [Gcl-devel] " Stefan Monnier
2014-10-30  3:08           ` Stephen J. Turnbull
2014-10-29 13:26       ` Referring to revisions in the git future Eric S. Raymond
2014-10-29 14:04         ` Stefan Monnier
2014-10-29 14:49           ` Eric S. Raymond
2014-10-30  2:43           ` Stephen J. Turnbull
2014-10-29 13:08     ` Jan Djärv
2014-10-29 13:27       ` Eric S. Raymond
2014-10-29 13:49         ` Eric S. Raymond
2014-10-29 18:03           ` Jan Djärv
2014-10-29 11:18   ` Alan Mackenzie
2014-10-29 11:37     ` David Kastrup

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83bnou26is.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=camm@maguirefamily.org \
    --cc=emacs-devel@gnu.org \
    --cc=gcl-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).