From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: "Stephen J. Turnbull" <stephen@xemacs.org>
Newsgroups: gmane.emacs.devel
Subject: Re: size of emacs executable after unicode merge
Date: Sat, 17 May 2008 07:07:58 +0900
Message-ID: <873aohucld.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <umymtvaq8.fsf@boris.laptop>
	<200805140351.m4E3pQuE004549@sallyv1.ics.uci.edu>
	<E1JwK16-0003kN-Iu@fencepost.gnu.org>
	<200805141652.m4EGqikr018644@sallyv1.ics.uci.edu>
	<E1JweIE-0005uV-PE@fencepost.gnu.org>
	<200805151529.m4FFTlF1004684@sallyv1.ics.uci.edu>
	<E1Jwy9X-00016Z-4m@fencepost.gnu.org>
	<E1Jwz6m-0006Ec-9i@etlken.m17n.org> <482D8435.6060407@gnu.org>
	<482DAF4B.60900@emf.net>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1210975047 28555 80.91.229.12 (16 May 2008 21:57:27 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 16 May 2008 21:57:27 +0000 (UTC)
Cc: rms@gnu.org, Kenichi Handa <handa@m17n.org>, emacs-devel@gnu.org,
	dann@ics.uci.edu, evilborisnet@netscape.net, Jason Rumney <jasonr@gnu.org>
To: Thomas Lord <lord@emf.net>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri May 16 23:58:02 2008
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1Jx7vm-00027B-PT
	for ged-emacs-devel@m.gmane.org; Fri, 16 May 2008 23:57:39 +0200
Original-Received: from localhost ([127.0.0.1]:38866 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1Jx7v3-0000Bq-EC
	for ged-emacs-devel@m.gmane.org; Fri, 16 May 2008 17:56:53 -0400
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1Jx7uz-0000Bb-1F
	for emacs-devel@gnu.org; Fri, 16 May 2008 17:56:49 -0400
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1Jx7uw-0000BP-I1
	for emacs-devel@gnu.org; Fri, 16 May 2008 17:56:47 -0400
Original-Received: from [199.232.76.173] (port=59223 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Jx7uw-0000BM-Ci
	for emacs-devel@gnu.org; Fri, 16 May 2008 17:56:46 -0400
Original-Received: from mtps01.sk.tsukuba.ac.jp ([130.158.97.223]:51999)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <stephen@xemacs.org>)
	id 1Jx7ug-00053a-M1; Fri, 16 May 2008 17:56:31 -0400
Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp
	[130.158.99.156])
	by mtps01.sk.tsukuba.ac.jp (Postfix) with ESMTP id E84271535AC;
	Sat, 17 May 2008 06:56:28 +0900 (JST)
Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000)
	id B49331A25C3; Sat, 17 May 2008 07:07:58 +0900 (JST)
In-Reply-To: <482DAF4B.60900@emf.net>
X-Mailer: VM ?bug? under XEmacs 21.5.21 (x86_64-unknown-linux)
X-detected-kernel: by monty-python.gnu.org: Linux 2.6, seldom 2.4 (older, 4)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:97294
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/97294>

Thomas Lord writes:

 > Jason Rumney wrote:
 > > How big are the data structures holding all the unicode character info 
 > > and translation tables for encodings? 

Is it possible that the whole Unicode range (17*2^16 code points) is
being dumped?  That would lead to about the size change observed,
extrapolating from my "naive estimate" for XEmacs implementation of
the BMP given below.  But surely no characters outside of the BMP are
needed to dump Emacs.

 > If that turns out to be the problem, will someone please contact me 
 > directly?
 > (I ask that because I mostly just skim this list and so miss things.)
 > 
 > Several years back I devoted a pretty decent number of hours to working
 > out good ways to compress the run-time representation of such tables
 > without sacrificing much performance on accesses.

Loading on demand is generally a better solution, as most non-Asians
use less than 500 characters, highly localized to about 3 ranges that
can be loaded individually.

Nor do you really need "good solutions", as half of the BMP is hanzi
and Hangul which are basically constant ranges for the character info
tables, and another 10% is private space and surrogates, leading to
approximately 60% savings by using ranges and appropriate defaults for
these four classes.  The non-BMP planes surely can be loaded on-demand.

 > If it would be helpful,

Did you do much better than 60% savings?  If not, it's probably not
really worth much effort given an efficient range table representation
already available.  In any case, something else is going on here
besides naive representation (assuming we're restricted to the BMP).

In XEmacs, where all coding tables for the BMP are loaded by default,
much more naive strategies than those outlined above give 891800 bytes
total for the to-unicode and from-unicode tables.  I think we're
missing a couple of charsets that Emacs Mule provides, but they're
minor.  We don't currently implement the Unidata base, but most (all?) 
of the character properties can be compactly represented as a small
number of Booleans each, so a table of bitvectors for the BMP "should"
only be about 256KB or maybe 512KB.  IIRC XEmacs/UTF-2000 implemented
the BMP Unidata as a Lisp array of Lisp bitvectors in about 1MB (most
of which is Lisp object overhead).

In other words, even with a naive strategy, the Unicode BMP database
should only add about 1.1MB to 1.4MB, ie, about 10% of the size
increase seen here, if coded compactly but straightforwardly in C.

A few straightforward optimizations can probably get that down to
500KB to 700KB, and for an on-demand setup, most Western users should
only see a footprint of about 10-15KB for Unicode data, if that.