From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Stephen J. Turnbull" Newsgroups: gmane.emacs.devel Subject: Re: size of emacs executable after unicode merge Date: Sat, 17 May 2008 07:07:58 +0900 Message-ID: <873aohucld.fsf@uwakimon.sk.tsukuba.ac.jp> References: <200805140351.m4E3pQuE004549@sallyv1.ics.uci.edu> <200805141652.m4EGqikr018644@sallyv1.ics.uci.edu> <200805151529.m4FFTlF1004684@sallyv1.ics.uci.edu> <482D8435.6060407@gnu.org> <482DAF4B.60900@emf.net> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1210975047 28555 80.91.229.12 (16 May 2008 21:57:27 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 16 May 2008 21:57:27 +0000 (UTC) Cc: rms@gnu.org, Kenichi Handa , emacs-devel@gnu.org, dann@ics.uci.edu, evilborisnet@netscape.net, Jason Rumney To: Thomas Lord Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri May 16 23:58:02 2008 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1Jx7vm-00027B-PT for ged-emacs-devel@m.gmane.org; Fri, 16 May 2008 23:57:39 +0200 Original-Received: from localhost ([127.0.0.1]:38866 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Jx7v3-0000Bq-EC for ged-emacs-devel@m.gmane.org; Fri, 16 May 2008 17:56:53 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Jx7uz-0000Bb-1F for emacs-devel@gnu.org; Fri, 16 May 2008 17:56:49 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Jx7uw-0000BP-I1 for emacs-devel@gnu.org; Fri, 16 May 2008 17:56:47 -0400 Original-Received: from [199.232.76.173] (port=59223 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Jx7uw-0000BM-Ci for emacs-devel@gnu.org; Fri, 16 May 2008 17:56:46 -0400 Original-Received: from mtps01.sk.tsukuba.ac.jp ([130.158.97.223]:51999) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1Jx7ug-00053a-M1; Fri, 16 May 2008 17:56:31 -0400 Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp [130.158.99.156]) by mtps01.sk.tsukuba.ac.jp (Postfix) with ESMTP id E84271535AC; Sat, 17 May 2008 06:56:28 +0900 (JST) Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000) id B49331A25C3; Sat, 17 May 2008 07:07:58 +0900 (JST) In-Reply-To: <482DAF4B.60900@emf.net> X-Mailer: VM ?bug? under XEmacs 21.5.21 (x86_64-unknown-linux) X-detected-kernel: by monty-python.gnu.org: Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:97294 Archived-At: Thomas Lord writes: > Jason Rumney wrote: > > How big are the data structures holding all the unicode character info > > and translation tables for encodings? Is it possible that the whole Unicode range (17*2^16 code points) is being dumped? That would lead to about the size change observed, extrapolating from my "naive estimate" for XEmacs implementation of the BMP given below. But surely no characters outside of the BMP are needed to dump Emacs. > If that turns out to be the problem, will someone please contact me > directly? > (I ask that because I mostly just skim this list and so miss things.) > > Several years back I devoted a pretty decent number of hours to working > out good ways to compress the run-time representation of such tables > without sacrificing much performance on accesses. Loading on demand is generally a better solution, as most non-Asians use less than 500 characters, highly localized to about 3 ranges that can be loaded individually. Nor do you really need "good solutions", as half of the BMP is hanzi and Hangul which are basically constant ranges for the character info tables, and another 10% is private space and surrogates, leading to approximately 60% savings by using ranges and appropriate defaults for these four classes. The non-BMP planes surely can be loaded on-demand. > If it would be helpful, Did you do much better than 60% savings? If not, it's probably not really worth much effort given an efficient range table representation already available. In any case, something else is going on here besides naive representation (assuming we're restricted to the BMP). In XEmacs, where all coding tables for the BMP are loaded by default, much more naive strategies than those outlined above give 891800 bytes total for the to-unicode and from-unicode tables. I think we're missing a couple of charsets that Emacs Mule provides, but they're minor. We don't currently implement the Unidata base, but most (all?) of the character properties can be compactly represented as a small number of Booleans each, so a table of bitvectors for the BMP "should" only be about 256KB or maybe 512KB. IIRC XEmacs/UTF-2000 implemented the BMP Unidata as a Lisp array of Lisp bitvectors in about 1MB (most of which is Lisp object overhead). In other words, even with a naive strategy, the Unicode BMP database should only add about 1.1MB to 1.4MB, ie, about 10% of the size increase seen here, if coded compactly but straightforwardly in C. A few straightforward optimizations can probably get that down to 500KB to 700KB, and for an on-demand setup, most Western users should only see a footprint of about 10-15KB for Unicode data, if that.