From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: vedm Newsgroups: gmane.emacs.help Subject: Re: Emacs in xterm and Cyrillic? Date: 13 Apr 2005 21:47:17 -0400 Message-ID: <861x9eruxm.fsf@localhost.localdomain> References: <86zmwfv3hj.fsf@localhost.localdomain> <86oecs7kvx.fsf@localhost.localdomain> <86is2v78hs.fsf@localhost.localdomain> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1113443263 22997 80.91.229.2 (14 Apr 2005 01:47:43 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Thu, 14 Apr 2005 01:47:43 +0000 (UTC) Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Thu Apr 14 03:47:37 2005 Return-path: Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1DLtS5-0003sU-EX for geh-help-gnu-emacs@m.gmane.org; Thu, 14 Apr 2005 03:47:30 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1DLtVF-0008Iw-Md for geh-help-gnu-emacs@m.gmane.org; Wed, 13 Apr 2005 21:50:45 -0400 Original-Path: shelby.stanford.edu!newsfeed.stanford.edu!postnews.google.com!news4.google.com!news.glorb.com!border1.nntp.dca.giganews.com!nntp.giganews.com!local01.nntp.dca.giganews.com!nntp.rogers.com!news.rogers.com.POSTED!not-for-mail Original-NNTP-Posting-Date: Wed, 13 Apr 2005 20:47:12 -0500 Original-Newsgroups: gnu.emacs.help User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.4 Original-Lines: 63 Original-NNTP-Posting-Host: 70.24.147.173 Original-X-Trace: sv3-Jzho7cm4B7RpqAj++f3s7da8LjKP+OCd6iZLu1x109gOifdsDkObvL+/BH00OweYDjIBzcYokvjJDe6!BRUBcj2HfaNfDU5K9Cb5ic4MBk7vp57YU6KDgGAxnEoDpJ5+JR4lJsvhUrYTiM6t9pcF214t7Xt2 Original-X-Complaints-To: abuse@rogers.com X-DMCA-Complaints-To: abuse@rogers.com X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.32 Original-Xref: shelby.stanford.edu gnu.emacs.help:130107 Original-To: help-gnu-emacs@gnu.org X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:25674 X-Report-Spam: http://spam.gmane.org/gmane.emacs.help:25674 Kevin Rodgers writes: > vedm wrote: > > My cyrillic files are encoded in iso8859-5, just because that encoding > > is within the ASCII set and is enough for the cyrilic script. Yes, I > > agree that UTF is better for handling all sorts of languages, but I > > still haven't tried to use it in emacs and Xterm. (One disadvantage of > > UTF is that the UTF files (at least cyrillic files) are almost two times > > bigger compared to ASCII encoded files). > > That can't be the case, because Cyrillic characters can't even be > represented in ASCII. The first half of this statement is true, the second one is not. When you say "That can't be the case" I assume you mean that iso-8859-5 is not ASCII, and that is true. Until now I thought of the ISO-8859 family of encodings as "8-bit ASCII". But now I did my research an found this good review of character codes: http://www.cs.tut.fi/~jkorpela/chars.html. As it says: The misnomer "8-bit ASCII" ...ASCII is strictly and unambiguously a 7-bit code in the sense that all code positions are in the range 0 - 127. It (the term "8-bit ASCII") is a misnomer used to refer to various character codes which are extensions of ASCII in the following sense: the character repertoire contains ASCII as a subset, the code numbers are in the range 0 - 255, and the code numbers of ASCII characters equal their ASCII codes." Now, the second part of your statement is that "Cyrillic characters can't even be represented in ASCII". But the Cyrillic alphabet consists of about 30 letters (Bulgarian - 30, Russian - 33), and the 7-bit ASCII code has 128 positions, which is clearly more than enough to encode 30 letters (or 60, for upper and lower case) In fact, the first Cyrillic encodings used a 7-bit char-set. A good discussion of the Cyrillic character sets can be found here: http://czyborra.com/charsets/cyrillic.html > It is true that your Cyrillic files will be > encoded in ISO-8859-5 with just 1 byte per character, whereas the > Cyrillic characters require 2 bytes in UTF-8 (I don't know about > UTF-16). But the actual size of the UTF-8 files will depend on how many > Cyrillic vs. ASCII characters are present, since the ASCII characters > are still represented as a single byte. My Cyrillic files consist entirely of Cyrillic characters (excluding special characters like spaces, new lines, dots etc), so they are invariably almost two times bigger when encoded in UTF-8. And these files are meant to be on my web server: so if they are UTF my server would have to pass double the data for each page...unless there is some compression trick. -- vedm