all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: "B. T. Raven" <nihil@nihilo.net>
To: help-gnu-emacs@gnu.org
Subject: Re: utf8 char display in buffer
Date: Fri, 12 Jun 2009 15:56:51 -0500	[thread overview]
Message-ID: <Bf-dnQHS37iJXK_XnZ2dnUVZ_rqdnZ2d@sysmatrix.net> (raw)
In-Reply-To: <f80a38de-2aa0-4d2d-955b-9f9cb35809bb@s1g2000prd.googlegroups.com>

Xah Lee wrote:
> On Jun 12, 7:54 am, ken <geb...@mousecar.com> wrote:
>> B) It would be helpful if the code which does the decoding of a file and
>> renders it into the buffer display, if that part of it would throw an
>> error message when it encounters a character it doesn't know how to
>> display, i.e., when a little box character is displayed. After all,
>> isn't it an error when a little box is displayed in lieu of the correct
>> character? Possible error messages would be something like: "decoding
>> process can't find /path/to/charset.file" or "decoding process doesn't
>> have requisite permission to read /path/to/charset.file" or "invalid
>> character: [hex/decimal value]" or other.
> 
> some thought process in the above is not correct.
> 
> In general, a program just read a text file as a byte stream, and
> using a encoding scheme to interprete it, the program has little way
> to determine if the encoding is correct. Theoretically, it could check
> with command phrases but that is generally not done by the software we
> use daily. (some program does scan text guess a encoding, but not
> always correct)
> 
> here's some general technical issues and experiences about using
> foreign chars:
> 
> • the software needs to know what encoding & char set is used in order
> to interprete the binary stream. If you don't specifically set it,
> typically it assumes ascii or some iso latin char set. (of software in
> USA anyway)
> 
> • today's software generally don't contain any extra heuistics to
> check if the encoding used is actually correct. There is no technical
> way to check that in general. It can be only heuristics, i.e. guesses.
> e.g. browsers will often guess when reading a page that doesn't have
> encoding info.
> 
> • even when the encoding is correct, the software needs all the proper
> fonts to display it. Or, rely on some font-replacement technology,
> e.g. when it finds a char which the current font doesn't have, it uses
> another font for that char. (in the case of Chinese, this often
> results in ugly text of mixed char style, some appear thin, some
> thick, some squarly (like sans-serif), some caligraphic, some
> bitmapped) Windows OS and OS X both has font-replacement technology,
> as well as all the major browsers for both os x and windows. This font
> replacement technology, however, is not perfect. So, sometimes you'll
> see squares or question marks here or there, especially on some chars
> that's not widely used (e.g. math symbols in unicode, double right
> arrow, tech symbols such as Apple's command key and option key, triple
> asterisk, etc.).
> 
> • when writing a file, the software needs to use a encoding to write
> it. Just like reading, if you havn't explicitly set it, typically it
> uses ascii or some iso latin char set, in most western lang countries.
> 
> • when you use a software to open a text but with wrong encoding info,
> the result is gibberish.
> 
> the above applies not just to emacs, but applies to all apps. Some
> commentary are based on my experiences with browsers, web pages, word
> processors, online forums, mailing list, email apps, instant messaging
> chat apps, etc, on both mac and windows.
> 
> technically, the issues involved is char set, encoding, font. ( the
> concept of char set and encoding are independent but is often mixed
> together in a spec, esp earlier ones).
> 
> i use mixed chinese & english in single file often and in both mac os
> x and windows. They work well. On the mac, my emacs is version 22.x.
> On win, it is emacs23. My encoding in emacs is set to utf-8.
> 
> I've wrote a lot about these issues, the following docs might be
> helpful.
> 
> • Emacs and Unicode Tips
>   http://xahlee.org/emacs/emacs_n_unicode.html
> 
> • Unicode Characters Example
>   http://xahlee.org/Periodic_dosage_dir/t1/20040505_unicode.html
> 
> • the Journey of a Foreign Character thru Internet
>   http://xahlee.org/Periodic_dosage_dir/t2/non-ascii_journey.html
> 
> • Converting a File's Encoding with Python
>   http://xahlee.org/perl-python/charset_encoding.html
> 
> • Character Sets and Encoding in HTML
>   http://xahlee.org/js/html_chars.html
> 
> • The Complexity And Tedium of Software Engineering (parts about
> unicode problem with unison and emacs)
>   http://xahlee.org/UnixResource_dir/writ/programer_frustration.html
> 
> • Mac and Windows File Conversion (parts about unicode filename
> issues)
>   http://xahlee.org/mswin/mac_windows_file_conv.html
> 
> • Windows Font and Unicode
>   http://xahlee.org/mswin/windows_font_unicode.html
> 
> the above article contain tens of links to Wikipedia in appropriate
> places. Wikipedia has massive info in digestable form about these
> issues, one can spend a month on the above foreign char issues ...
> 
> for some examples of mixed chinese & english text i work with, see:
> 
> • Chinese Core Simplified Chars
>   http://xahlee.org/lojban/simplified_chars.html
> 
> • Ethology, Ethnology, and Lyrics
>   http://xahlee.org/Periodic_dosage_dir/sanga_pemci/sanga_pemci.html
> 
>   Xah
> ∑ http://xahlee.org/
> 
> ☄

Totally OT but prima facie the mosting interesting title is the last. 
Unfortunately I couldn't grok what ethology (the "anthropology" of 
animals)had to do with it unless the critters that emit "The Masochistic 
Cries of Lovelorn Females" are to be considered as less than human. I 
notice that Salt-n-Pepa's sweet little ditty (Don't want no S.D.M.) is 
missing from the list, but maybe that's more sadistic than masochistic; 
maybe it belongs in the Quagmire. ;-) Sexology is a bona fide area of 
inquiry pioneered by Kinsey et al. but sexualogy is not an English word 
nor (I keep my fingers crossed) will it ever become one.


  parent reply	other threads:[~2009-06-12 20:56 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <mailman.227.1244485995.2239.help-gnu-emacs@gnu.org>
2009-06-08 19:10 ` utf8 char display in buffer Teemu Likonen
2009-06-08 19:52 ` Xah Lee
2009-06-09 10:52   ` ken
2009-06-08 20:43 ` B. T. Raven
2009-06-08 20:49   ` B. T. Raven
2009-06-08 22:49     ` ken
2009-06-09 10:24   ` ken
     [not found]   ` <mailman.289.1244543082.2239.help-gnu-emacs@gnu.org>
2009-06-09 13:03     ` B. T. Raven
2009-06-09 14:51       ` ken
     [not found]       ` <mailman.297.1244559110.2239.help-gnu-emacs@gnu.org>
2009-06-10  1:34         ` B. T. Raven
2009-06-10 14:03           ` Lewis Perin
2009-06-11  3:21             ` B. T. Raven
2009-06-12 14:54               ` ken
2009-06-13  3:30                 ` Eli Zaretskii
     [not found]               ` <mailman.522.1244818530.2239.help-gnu-emacs@gnu.org>
2009-06-12 15:39                 ` Lewis Perin
2009-06-12 16:48                   ` B. T. Raven
2009-06-12 17:45                     ` Lewis Perin
2009-06-12 17:53                     ` Xah Lee
2009-06-12 20:59                       ` Lennart Borgman
2009-06-12 22:23                       ` ken
2009-06-12 22:27                         ` Lennart Borgman
2009-06-12 23:38                           ` ken
2009-06-13  4:11                             ` Eli Zaretskii
2009-06-13 12:30                               ` ken
2009-06-13 13:23                                 ` Eli Zaretskii
2009-06-14 20:59                             ` Stefan Monnier
2009-06-13  1:36                           ` Miles Bader
2009-06-13  1:43                             ` Lennart Borgman
2009-06-13  5:50                             ` Richard Stallman
2009-06-15  4:34                               ` Miles Bader
2009-06-15 19:30                                 ` Richard Stallman
2009-06-16  0:30                                   ` James Cloos
2009-06-16  1:10                                     ` Miles Bader
2009-06-16  1:12                                       ` Miles Bader
2009-06-17  5:07                                         ` Richard Stallman
2009-06-16 13:53                                     ` Chong Yidong
2009-06-16 20:48                                   ` Stefan Monnier
2009-06-15 20:06                               ` Chong Yidong
2009-06-15 21:57                                 ` Drew Adams
2009-06-16  5:30                                 ` Richard Stallman
     [not found]                       ` <mailman.536.1244845400.2239.help-gnu-emacs@gnu.org>
2009-06-13  0:35                         ` Xah Lee
2009-06-12 17:27                 ` Xah Lee
2009-06-12 19:30                   ` Lewis Perin
2009-06-12 19:43                     ` Xah Lee
2009-06-12 20:56                   ` B. T. Raven [this message]
2009-06-13 16:16                     ` Xah Lee
2009-06-13 20:35                   ` Lewis Perin
2009-06-14 11:47                     ` ken
2009-06-15  7:28                       ` Bernardo
2009-06-11 12:03 ` Teemu Likonen
2009-06-11 12:55   ` Lennart Borgman
2009-06-11 13:04     ` Andreas Schwab
2009-06-11 13:07       ` Lennart Borgman
2009-06-11 13:08         ` Lennart Borgman
2009-06-11 13:24           ` Tassilo Horn
2009-06-08 18:33 ken

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Bf-dnQHS37iJXK_XnZ2dnUVZ_rqdnZ2d@sysmatrix.net \
    --to=nihil@nihilo.net \
    --cc=help-gnu-emacs@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.