From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: vedm <ns@nospam.com>
Newsgroups: gmane.emacs.help
Subject: Re: Emacs in xterm and Cyrillic?
Date: 13 Apr 2005 21:47:17 -0400
Message-ID: <861x9eruxm.fsf@localhost.localdomain>
References: <86zmwfv3hj.fsf@localhost.localdomain>
	<mailman.78.1112634873.2895.help-gnu-emacs@gnu.org>
	<86oecs7kvx.fsf@localhost.localdomain>
	<mailman.354.1112774023.2895.help-gnu-emacs@gnu.org>
	<86is2v78hs.fsf@localhost.localdomain>
	<mailman.1082.1113240243.2895.help-gnu-emacs@gnu.org>
NNTP-Posting-Host: main.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: sea.gmane.org 1113443263 22997 80.91.229.2 (14 Apr 2005 01:47:43 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Thu, 14 Apr 2005 01:47:43 +0000 (UTC)
Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Thu Apr 14 03:47:37 2005
Return-path: <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>
Original-Received: from lists.gnu.org ([199.232.76.165])
	by ciao.gmane.org with esmtp (Exim 4.43)
	id 1DLtS5-0003sU-EX
	for geh-help-gnu-emacs@m.gmane.org; Thu, 14 Apr 2005 03:47:30 +0200
Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1DLtVF-0008Iw-Md
	for geh-help-gnu-emacs@m.gmane.org; Wed, 13 Apr 2005 21:50:45 -0400
Original-Path: shelby.stanford.edu!newsfeed.stanford.edu!postnews.google.com!news4.google.com!news.glorb.com!border1.nntp.dca.giganews.com!nntp.giganews.com!local01.nntp.dca.giganews.com!nntp.rogers.com!news.rogers.com.POSTED!not-for-mail
Original-NNTP-Posting-Date: Wed, 13 Apr 2005 20:47:12 -0500
Original-Newsgroups: gnu.emacs.help
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.4
Original-Lines: 63
Original-NNTP-Posting-Host: 70.24.147.173
Original-X-Trace: sv3-Jzho7cm4B7RpqAj++f3s7da8LjKP+OCd6iZLu1x109gOifdsDkObvL+/BH00OweYDjIBzcYokvjJDe6!BRUBcj2HfaNfDU5K9Cb5ic4MBk7vp57YU6KDgGAxnEoDpJ5+JR4lJsvhUrYTiM6t9pcF214t7Xt2
Original-X-Complaints-To: abuse@rogers.com
X-DMCA-Complaints-To: abuse@rogers.com
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint
	properly
X-Postfilter: 1.3.32
Original-Xref: shelby.stanford.edu gnu.emacs.help:130107
Original-To: help-gnu-emacs@gnu.org
X-BeenThere: help-gnu-emacs@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Users list for the GNU Emacs text editor <help-gnu-emacs.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/help-gnu-emacs>
List-Post: <mailto:help-gnu-emacs@gnu.org>
List-Help: <mailto:help-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=subscribe>
Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org
Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.help:25674
X-Report-Spam: http://spam.gmane.org/gmane.emacs.help:25674

Kevin Rodgers <ihs_4664@yahoo.com> writes:

> vedm wrote:
>  > My cyrillic files are encoded in iso8859-5, just because that encoding
>  > is within the ASCII set and is enough for the cyrilic script. Yes, I
>  > agree that UTF is better for handling all sorts of languages, but I
>  > still haven't tried to use it in emacs and Xterm. (One disadvantage of
>  > UTF is that the UTF files (at least cyrillic files) are almost two times
>  > bigger compared to ASCII encoded files).
> 
> That can't be the case, because Cyrillic characters can't even be
> represented in ASCII.  

The first half of this statement is true, the second one is not. When
you say "That can't be the case" I assume you mean that iso-8859-5 is
not ASCII, and that is true. Until now I thought of the ISO-8859 family
of encodings as "8-bit ASCII". But now I did my research an found this
good review of character codes:
http://www.cs.tut.fi/~jkorpela/chars.html. 
As it says:

<quote>

The misnomer "8-bit ASCII"

...ASCII is strictly and unambiguously a 7-bit code in the sense that
all code positions are in the range 0 - 127.

It (the term "8-bit ASCII") is a misnomer used to refer to various
character codes which are extensions of ASCII in the following sense:
the character repertoire contains ASCII as a subset, the code numbers
are in the range 0 - 255, and the code numbers of ASCII characters equal
their ASCII codes."

</quote>

Now, the second part of your statement is that "Cyrillic characters
can't even be represented in ASCII". But the Cyrillic alphabet consists
of about 30 letters (Bulgarian - 30, Russian - 33), and the 7-bit ASCII
code has 128 positions, which is clearly more than enough to encode 30
letters (or 60, for upper and lower case)

In fact, the first Cyrillic encodings used a 7-bit char-set. A good
discussion of the Cyrillic character sets can be found here:
http://czyborra.com/charsets/cyrillic.html

> It is true that your Cyrillic files will be
> encoded in ISO-8859-5 with just 1 byte per character, whereas the
> Cyrillic characters require 2 bytes in UTF-8 (I don't know about
> UTF-16).  But the actual size of the UTF-8 files will depend on how many
> Cyrillic vs. ASCII characters are present, since the ASCII characters
> are still represented as a single byte.

My Cyrillic files consist entirely of Cyrillic characters (excluding
special characters like spaces, new lines, dots etc), so they are
invariably almost two times bigger when encoded in UTF-8. And these
files are meant to be on my web server: so if they are UTF my server
would have to pass double the data for each page...unless there is some
compression trick.


-- 
vedm