unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
From: vedm <ns@nospam.com>
Subject: Re: Emacs in xterm and Cyrillic?
Date: 13 Apr 2005 21:47:17 -0400	[thread overview]
Message-ID: <861x9eruxm.fsf@localhost.localdomain> (raw)
In-Reply-To: mailman.1082.1113240243.2895.help-gnu-emacs@gnu.org

Kevin Rodgers <ihs_4664@yahoo.com> writes:

> vedm wrote:
>  > My cyrillic files are encoded in iso8859-5, just because that encoding
>  > is within the ASCII set and is enough for the cyrilic script. Yes, I
>  > agree that UTF is better for handling all sorts of languages, but I
>  > still haven't tried to use it in emacs and Xterm. (One disadvantage of
>  > UTF is that the UTF files (at least cyrillic files) are almost two times
>  > bigger compared to ASCII encoded files).
> 
> That can't be the case, because Cyrillic characters can't even be
> represented in ASCII.  

The first half of this statement is true, the second one is not. When
you say "That can't be the case" I assume you mean that iso-8859-5 is
not ASCII, and that is true. Until now I thought of the ISO-8859 family
of encodings as "8-bit ASCII". But now I did my research an found this
good review of character codes:
http://www.cs.tut.fi/~jkorpela/chars.html. 
As it says:

<quote>

The misnomer "8-bit ASCII"

...ASCII is strictly and unambiguously a 7-bit code in the sense that
all code positions are in the range 0 - 127.

It (the term "8-bit ASCII") is a misnomer used to refer to various
character codes which are extensions of ASCII in the following sense:
the character repertoire contains ASCII as a subset, the code numbers
are in the range 0 - 255, and the code numbers of ASCII characters equal
their ASCII codes."

</quote>

Now, the second part of your statement is that "Cyrillic characters
can't even be represented in ASCII". But the Cyrillic alphabet consists
of about 30 letters (Bulgarian - 30, Russian - 33), and the 7-bit ASCII
code has 128 positions, which is clearly more than enough to encode 30
letters (or 60, for upper and lower case)

In fact, the first Cyrillic encodings used a 7-bit char-set. A good
discussion of the Cyrillic character sets can be found here:
http://czyborra.com/charsets/cyrillic.html

> It is true that your Cyrillic files will be
> encoded in ISO-8859-5 with just 1 byte per character, whereas the
> Cyrillic characters require 2 bytes in UTF-8 (I don't know about
> UTF-16).  But the actual size of the UTF-8 files will depend on how many
> Cyrillic vs. ASCII characters are present, since the ASCII characters
> are still represented as a single byte.

My Cyrillic files consist entirely of Cyrillic characters (excluding
special characters like spaces, new lines, dots etc), so they are
invariably almost two times bigger when encoded in UTF-8. And these
files are meant to be on my web server: so if they are UTF my server
would have to pass double the data for each page...unless there is some
compression trick.


-- 
vedm

  parent reply	other threads:[~2005-04-14  1:47 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-03 23:34 Emacs in xterm and Cyrillic? vedm
2005-04-04 17:43 ` Peter Dyballa
2005-04-05 20:49 ` Stefan Monnier
     [not found] ` <mailman.78.1112634873.2895.help-gnu-emacs@gnu.org>
2005-04-06  1:29   ` vedm
2005-04-06  8:18     ` Peter Dyballa
2005-04-06 15:40     ` Stefan Monnier
2005-04-10  0:26       ` vedm
     [not found]     ` <mailman.354.1112774023.2895.help-gnu-emacs@gnu.org>
2005-04-10  0:58       ` vedm
2005-04-10 12:54         ` Recommendations: emacs files containing multiple languages ken
2005-04-10 13:33           ` Peter Dyballa
2005-04-10 19:06           ` Eli Zaretskii
2005-04-10 13:20         ` Emacs in xterm and Cyrillic? Peter Dyballa
2005-04-11 17:48         ` Kevin Rodgers
     [not found]         ` <mailman.1082.1113240243.2895.help-gnu-emacs@gnu.org>
2005-04-14  1:47           ` vedm [this message]
2005-04-14  2:03             ` vedm

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=861x9eruxm.fsf@localhost.localdomain \
    --to=ns@nospam.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).