From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "B. T. Raven" Newsgroups: gmane.emacs.help Subject: Re: utf8 char display in buffer Date: Fri, 12 Jun 2009 15:56:51 -0500 Message-ID: References: <7I2dndeTy7sqkLLXnZ2dnUVZ_gmdnZ2d@sysmatrix.net> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1244842850 21858 80.91.229.12 (12 Jun 2009 21:40:50 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 12 Jun 2009 21:40:50 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Fri Jun 12 23:40:47 2009 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1MFEUP-0005yP-RR for geh-help-gnu-emacs@m.gmane.org; Fri, 12 Jun 2009 23:40:46 +0200 Original-Received: from localhost ([127.0.0.1]:42491 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MFEUP-0004qQ-BU for geh-help-gnu-emacs@m.gmane.org; Fri, 12 Jun 2009 17:40:45 -0400 Original-Path: news.stanford.edu!newsfeed.stanford.edu!postnews.google.com!news1.google.com!news2.google.com!border1.nntp.dca.giganews.com!border2.nntp.dca.giganews.com!nntp.giganews.com!backlog2.nntp.dca.giganews.com!nntp.sysmatrix.net!news.sysmatrix.net.POSTED!not-for-mail Original-NNTP-Posting-Date: Fri, 12 Jun 2009 15:56:51 -0500 User-Agent: Thunderbird 2.0.0.21 (Windows/20090302) Original-Newsgroups: gnu.emacs.help,comp.emacs In-Reply-To: Original-Lines: 124 X-Usenet-Provider: http://www.giganews.com Original-NNTP-Posting-Host: 12.73.133.81 Original-X-Trace: sv3-WdBLqJr1Kf25ddjWskB3jqOQOzvSWM/4yNDFzyRjatNZ9mCs13oeNdiDMofe5ws6QrRsWkTx+qW7zsz!a2yWs8UrHRb2r0nGVPI4uVBA4Lov511OJVZyAs5bIF2jPODl0lYhZsd9s7e/dnMt4qUSjmo7efCg!hsvh22t/qD93cePoIP1M8XSgYoa6hQ== Original-X-Complaints-To: abuse@sysmatrix.net X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.39 X-Original-Bytes: 7575 Original-Xref: news.stanford.edu gnu.emacs.help:169997 comp.emacs:98255 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:65226 Archived-At: Xah Lee wrote: > On Jun 12, 7:54 am, ken wrote: >> B) It would be helpful if the code which does the decoding of a file and >> renders it into the buffer display, if that part of it would throw an >> error message when it encounters a character it doesn't know how to >> display, i.e., when a little box character is displayed. After all, >> isn't it an error when a little box is displayed in lieu of the correct >> character? Possible error messages would be something like: "decoding >> process can't find /path/to/charset.file" or "decoding process doesn't >> have requisite permission to read /path/to/charset.file" or "invalid >> character: [hex/decimal value]" or other. > > some thought process in the above is not correct. > > In general, a program just read a text file as a byte stream, and > using a encoding scheme to interprete it, the program has little way > to determine if the encoding is correct. Theoretically, it could check > with command phrases but that is generally not done by the software we > use daily. (some program does scan text guess a encoding, but not > always correct) > > here's some general technical issues and experiences about using > foreign chars: > > • the software needs to know what encoding & char set is used in order > to interprete the binary stream. If you don't specifically set it, > typically it assumes ascii or some iso latin char set. (of software in > USA anyway) > > • today's software generally don't contain any extra heuistics to > check if the encoding used is actually correct. There is no technical > way to check that in general. It can be only heuristics, i.e. guesses. > e.g. browsers will often guess when reading a page that doesn't have > encoding info. > > • even when the encoding is correct, the software needs all the proper > fonts to display it. Or, rely on some font-replacement technology, > e.g. when it finds a char which the current font doesn't have, it uses > another font for that char. (in the case of Chinese, this often > results in ugly text of mixed char style, some appear thin, some > thick, some squarly (like sans-serif), some caligraphic, some > bitmapped) Windows OS and OS X both has font-replacement technology, > as well as all the major browsers for both os x and windows. This font > replacement technology, however, is not perfect. So, sometimes you'll > see squares or question marks here or there, especially on some chars > that's not widely used (e.g. math symbols in unicode, double right > arrow, tech symbols such as Apple's command key and option key, triple > asterisk, etc.). > > • when writing a file, the software needs to use a encoding to write > it. Just like reading, if you havn't explicitly set it, typically it > uses ascii or some iso latin char set, in most western lang countries. > > • when you use a software to open a text but with wrong encoding info, > the result is gibberish. > > the above applies not just to emacs, but applies to all apps. Some > commentary are based on my experiences with browsers, web pages, word > processors, online forums, mailing list, email apps, instant messaging > chat apps, etc, on both mac and windows. > > technically, the issues involved is char set, encoding, font. ( the > concept of char set and encoding are independent but is often mixed > together in a spec, esp earlier ones). > > i use mixed chinese & english in single file often and in both mac os > x and windows. They work well. On the mac, my emacs is version 22.x. > On win, it is emacs23. My encoding in emacs is set to utf-8. > > I've wrote a lot about these issues, the following docs might be > helpful. > > • Emacs and Unicode Tips > http://xahlee.org/emacs/emacs_n_unicode.html > > • Unicode Characters Example > http://xahlee.org/Periodic_dosage_dir/t1/20040505_unicode.html > > • the Journey of a Foreign Character thru Internet > http://xahlee.org/Periodic_dosage_dir/t2/non-ascii_journey.html > > • Converting a File's Encoding with Python > http://xahlee.org/perl-python/charset_encoding.html > > • Character Sets and Encoding in HTML > http://xahlee.org/js/html_chars.html > > • The Complexity And Tedium of Software Engineering (parts about > unicode problem with unison and emacs) > http://xahlee.org/UnixResource_dir/writ/programer_frustration.html > > • Mac and Windows File Conversion (parts about unicode filename > issues) > http://xahlee.org/mswin/mac_windows_file_conv.html > > • Windows Font and Unicode > http://xahlee.org/mswin/windows_font_unicode.html > > the above article contain tens of links to Wikipedia in appropriate > places. Wikipedia has massive info in digestable form about these > issues, one can spend a month on the above foreign char issues ... > > for some examples of mixed chinese & english text i work with, see: > > • Chinese Core Simplified Chars > http://xahlee.org/lojban/simplified_chars.html > > • Ethology, Ethnology, and Lyrics > http://xahlee.org/Periodic_dosage_dir/sanga_pemci/sanga_pemci.html > > Xah > ∑ http://xahlee.org/ > > ☄ Totally OT but prima facie the mosting interesting title is the last. Unfortunately I couldn't grok what ethology (the "anthropology" of animals)had to do with it unless the critters that emit "The Masochistic Cries of Lovelorn Females" are to be considered as less than human. I notice that Salt-n-Pepa's sweet little ditty (Don't want no S.D.M.) is missing from the list, but maybe that's more sadistic than masochistic; maybe it belongs in the Quagmire. ;-) Sexology is a bona fide area of inquiry pioneered by Kinsey et al. but sexualogy is not an English word nor (I keep my fingers crossed) will it ever become one.