From: "Stephen J. Turnbull" <stephen@xemacs.org>
To: Thomas Lord <lord@emf.net>
Cc: rms@gnu.org, Kenichi Handa <handa@m17n.org>,
emacs-devel@gnu.org, dann@ics.uci.edu, evilborisnet@netscape.net,
Jason Rumney <jasonr@gnu.org>
Subject: Re: size of emacs executable after unicode merge
Date: Sat, 17 May 2008 07:07:58 +0900 [thread overview]
Message-ID: <873aohucld.fsf@uwakimon.sk.tsukuba.ac.jp> (raw)
In-Reply-To: <482DAF4B.60900@emf.net>
Thomas Lord writes:
> Jason Rumney wrote:
> > How big are the data structures holding all the unicode character info
> > and translation tables for encodings?
Is it possible that the whole Unicode range (17*2^16 code points) is
being dumped? That would lead to about the size change observed,
extrapolating from my "naive estimate" for XEmacs implementation of
the BMP given below. But surely no characters outside of the BMP are
needed to dump Emacs.
> If that turns out to be the problem, will someone please contact me
> directly?
> (I ask that because I mostly just skim this list and so miss things.)
>
> Several years back I devoted a pretty decent number of hours to working
> out good ways to compress the run-time representation of such tables
> without sacrificing much performance on accesses.
Loading on demand is generally a better solution, as most non-Asians
use less than 500 characters, highly localized to about 3 ranges that
can be loaded individually.
Nor do you really need "good solutions", as half of the BMP is hanzi
and Hangul which are basically constant ranges for the character info
tables, and another 10% is private space and surrogates, leading to
approximately 60% savings by using ranges and appropriate defaults for
these four classes. The non-BMP planes surely can be loaded on-demand.
> If it would be helpful,
Did you do much better than 60% savings? If not, it's probably not
really worth much effort given an efficient range table representation
already available. In any case, something else is going on here
besides naive representation (assuming we're restricted to the BMP).
In XEmacs, where all coding tables for the BMP are loaded by default,
much more naive strategies than those outlined above give 891800 bytes
total for the to-unicode and from-unicode tables. I think we're
missing a couple of charsets that Emacs Mule provides, but they're
minor. We don't currently implement the Unidata base, but most (all?)
of the character properties can be compactly represented as a small
number of Booleans each, so a table of bitvectors for the BMP "should"
only be about 256KB or maybe 512KB. IIRC XEmacs/UTF-2000 implemented
the BMP Unidata as a Lisp array of Lisp bitvectors in about 1MB (most
of which is Lisp object overhead).
In other words, even with a naive strategy, the Unicode BMP database
should only add about 1.1MB to 1.4MB, ie, about 10% of the size
increase seen here, if coded compactly but straightforwardly in C.
A few straightforward optimizations can probably get that down to
500KB to 700KB, and for an on-demand setup, most Western users should
only see a footprint of about 10-15KB for Unicode data, if that.
next prev parent reply other threads:[~2008-05-16 22:07 UTC|newest]
Thread overview: 86+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-05-14 3:13 size of emacs executable after unicode merge Evil Boris
2008-05-14 3:51 ` Dan Nicolaescu
2008-05-14 16:39 ` Richard M Stallman
2008-05-14 16:52 ` Dan Nicolaescu
2008-05-15 14:18 ` Richard M Stallman
2008-05-15 15:29 ` Dan Nicolaescu
2008-05-16 11:31 ` Richard M Stallman
2008-05-16 12:06 ` Dan Nicolaescu
2008-05-16 12:32 ` Kenichi Handa
2008-05-16 12:55 ` Jason Rumney
2008-05-16 15:59 ` Thomas Lord
2008-05-16 22:07 ` Stephen J. Turnbull [this message]
2008-05-16 23:01 ` Thomas Lord
2008-05-17 0:56 ` Kenichi Handa
2008-05-17 1:52 ` YAMAMOTO Mitsuharu
2008-05-19 1:45 ` Kenichi Handa
2008-10-30 10:18 ` Emanuele Giaquinta
2008-10-30 21:22 ` Eli Zaretskii
2008-10-30 21:42 ` Stefan Monnier
2008-10-31 3:55 ` Richard M. Stallman
2008-10-31 5:29 ` Kenichi Handa
2008-10-31 6:32 ` Chong Yidong
2008-10-31 7:32 ` Kenichi Handa
2008-10-31 10:09 ` Eli Zaretskii
2008-10-31 12:33 ` gdb error [Re: size of emacs executable after unicode merge] Kenichi Handa
2008-10-31 14:28 ` Eli Zaretskii
2008-10-31 12:35 ` size of emacs executable after unicode merge Stephen Berman
2008-11-21 12:32 ` Kenichi Handa
2008-11-21 14:18 ` Ulrich Mueller
2008-10-31 10:41 ` YAMAMOTO Mitsuharu
2008-10-31 15:07 ` Dan Nicolaescu
2008-10-31 16:44 ` Stefan Monnier
2008-11-04 23:09 ` Chong Yidong
2008-11-05 4:17 ` Kenichi Handa
2008-11-05 15:50 ` Stefan Monnier
2008-11-06 7:56 ` Kenichi Handa
2008-11-08 2:42 ` Stefan Monnier
2008-11-08 4:10 ` Chong Yidong
2008-11-08 9:19 ` Eli Zaretskii
2008-11-09 0:27 ` Richard M. Stallman
2008-11-09 6:29 ` Dan Nicolaescu
2008-11-09 17:11 ` Richard M. Stallman
2008-11-10 1:24 ` Stefan Monnier
2008-11-10 1:55 ` Thomas Lord
2008-11-11 4:37 ` Chong Yidong
2008-11-08 10:30 ` Dan Nicolaescu
2008-11-09 20:14 ` Chong Yidong
2008-11-10 1:59 ` Kenichi Handa
2008-11-10 15:18 ` Chong Yidong
2008-11-10 23:18 ` Chong Yidong
2008-11-11 18:17 ` Chong Yidong
2008-11-12 6:26 ` Kenichi Handa
2008-11-13 16:33 ` Chong Yidong
2008-11-14 0:48 ` Kenichi Handa
2008-11-27 11:20 ` Kenichi Handa
2008-11-27 16:07 ` Chong Yidong
2008-11-27 16:12 ` Dan Nicolaescu
2008-11-28 1:02 ` Kenichi Handa
2008-11-27 16:31 ` Stefan Monnier
2008-11-27 20:17 ` Richard M Stallman
2008-11-27 20:42 ` Eli Zaretskii
2008-11-28 1:47 ` Kenichi Handa
2008-11-28 15:38 ` Richard M Stallman
2008-11-29 1:52 ` Kenichi Handa
2008-11-29 10:47 ` Eli Zaretskii
2008-11-29 19:43 ` Richard M Stallman
2008-11-30 4:50 ` Chetan Pandya
2008-11-28 16:11 ` Juanma Barranquero
2008-11-29 1:47 ` Kenichi Handa
2008-11-29 11:13 ` Juanma Barranquero
2008-11-29 12:17 ` Juanma Barranquero
2008-11-29 13:50 ` Kenichi Handa
2008-11-29 15:05 ` Juanma Barranquero
2008-11-05 22:30 ` Richard M. Stallman
2008-11-06 11:58 ` Kenichi Handa
2008-11-07 12:39 ` Richard M. Stallman
2008-11-07 13:29 ` Stephen J. Turnbull
2008-11-07 21:15 ` Richard M. Stallman
2008-11-08 4:00 ` Stephen J. Turnbull
2008-11-08 4:19 ` Stefan Monnier
2008-10-31 19:30 ` Richard M. Stallman
2008-11-09 22:43 ` Chong Yidong
2008-11-09 22:57 ` Chong Yidong
2008-11-10 1:28 ` Kenichi Handa
2008-11-10 19:29 ` Richard M. Stallman
2008-11-10 1:26 ` Kenichi Handa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=873aohucld.fsf@uwakimon.sk.tsukuba.ac.jp \
--to=stephen@xemacs.org \
--cc=dann@ics.uci.edu \
--cc=emacs-devel@gnu.org \
--cc=evilborisnet@netscape.net \
--cc=handa@m17n.org \
--cc=jasonr@gnu.org \
--cc=lord@emf.net \
--cc=rms@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).