From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Xah Lee Newsgroups: gmane.emacs.help Subject: Re: utf8 char display in buffer Date: Fri, 12 Jun 2009 10:27:02 -0700 (PDT) Organization: http://groups.google.com Message-ID: References: <7I2dndeTy7sqkLLXnZ2dnUVZ_gmdnZ2d@sysmatrix.net> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1244839856 13430 80.91.229.12 (12 Jun 2009 20:50:56 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 12 Jun 2009 20:50:56 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Fri Jun 12 22:50:54 2009 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1MFDi7-0001wu-7Q for geh-help-gnu-emacs@m.gmane.org; Fri, 12 Jun 2009 22:50:51 +0200 Original-Received: from localhost ([127.0.0.1]:36494 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MFDi6-00012o-MS for geh-help-gnu-emacs@m.gmane.org; Fri, 12 Jun 2009 16:50:50 -0400 Original-Path: news.stanford.edu!newsfeed.stanford.edu!postnews.google.com!s1g2000prd.googlegroups.com!not-for-mail Original-Newsgroups: gnu.emacs.help,comp.emacs Original-Lines: 117 Original-NNTP-Posting-Host: 76.102.12.87 Original-X-Trace: posting.google.com 1244827622 18333 127.0.0.1 (12 Jun 2009 17:27:02 GMT) Original-X-Complaints-To: groups-abuse@google.com Original-NNTP-Posting-Date: Fri, 12 Jun 2009 17:27:02 +0000 (UTC) Complaints-To: groups-abuse@google.com Injection-Info: s1g2000prd.googlegroups.com; posting-host=76.102.12.87; posting-account=bRPKjQoAAACxZsR8_VPXCX27T2YcsyMA User-Agent: G2/1.0 X-HTTP-UserAgent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/530.5 (KHTML, like Gecko) Chrome/2.0.172.31 Safari/530.5, gzip(gfe), gzip(gfe) Original-Xref: news.stanford.edu gnu.emacs.help:169985 comp.emacs:98251 X-Mailman-Approved-At: Fri, 12 Jun 2009 16:50:25 -0400 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:65215 Archived-At: On Jun 12, 7:54 am, ken wrote: > B) It would be helpful if the code which does the decoding of a file and > renders it into the buffer display, if that part of it would throw an > error message when it encounters a character it doesn't know how to > display, i.e., when a little box character is displayed. After all, > isn't it an error when a little box is displayed in lieu of the correct > character? Possible error messages would be something like: "decoding > process can't find /path/to/charset.file" or "decoding process doesn't > have requisite permission to read /path/to/charset.file" or "invalid > character: [hex/decimal value]" or other. some thought process in the above is not correct. In general, a program just read a text file as a byte stream, and using a encoding scheme to interprete it, the program has little way to determine if the encoding is correct. Theoretically, it could check with command phrases but that is generally not done by the software we use daily. (some program does scan text guess a encoding, but not always correct) here's some general technical issues and experiences about using foreign chars: =E2=80=A2 the software needs to know what encoding & char set is used in or= der to interprete the binary stream. If you don't specifically set it, typically it assumes ascii or some iso latin char set. (of software in USA anyway) =E2=80=A2 today's software generally don't contain any extra heuistics to check if the encoding used is actually correct. There is no technical way to check that in general. It can be only heuristics, i.e. guesses. e.g. browsers will often guess when reading a page that doesn't have encoding info. =E2=80=A2 even when the encoding is correct, the software needs all the pro= per fonts to display it. Or, rely on some font-replacement technology, e.g. when it finds a char which the current font doesn't have, it uses another font for that char. (in the case of Chinese, this often results in ugly text of mixed char style, some appear thin, some thick, some squarly (like sans-serif), some caligraphic, some bitmapped) Windows OS and OS X both has font-replacement technology, as well as all the major browsers for both os x and windows. This font replacement technology, however, is not perfect. So, sometimes you'll see squares or question marks here or there, especially on some chars that's not widely used (e.g. math symbols in unicode, double right arrow, tech symbols such as Apple's command key and option key, triple asterisk, etc.). =E2=80=A2 when writing a file, the software needs to use a encoding to writ= e it. Just like reading, if you havn't explicitly set it, typically it uses ascii or some iso latin char set, in most western lang countries. =E2=80=A2 when you use a software to open a text but with wrong encoding in= fo, the result is gibberish. the above applies not just to emacs, but applies to all apps. Some commentary are based on my experiences with browsers, web pages, word processors, online forums, mailing list, email apps, instant messaging chat apps, etc, on both mac and windows. technically, the issues involved is char set, encoding, font. ( the concept of char set and encoding are independent but is often mixed together in a spec, esp earlier ones). i use mixed chinese & english in single file often and in both mac os x and windows. They work well. On the mac, my emacs is version 22.x. On win, it is emacs23. My encoding in emacs is set to utf-8. I've wrote a lot about these issues, the following docs might be helpful. =E2=80=A2 Emacs and Unicode Tips http://xahlee.org/emacs/emacs_n_unicode.html =E2=80=A2 Unicode Characters Example http://xahlee.org/Periodic_dosage_dir/t1/20040505_unicode.html =E2=80=A2 the Journey of a Foreign Character thru Internet http://xahlee.org/Periodic_dosage_dir/t2/non-ascii_journey.html =E2=80=A2 Converting a File's Encoding with Python http://xahlee.org/perl-python/charset_encoding.html =E2=80=A2 Character Sets and Encoding in HTML http://xahlee.org/js/html_chars.html =E2=80=A2 The Complexity And Tedium of Software Engineering (parts about unicode problem with unison and emacs) http://xahlee.org/UnixResource_dir/writ/programer_frustration.html =E2=80=A2 Mac and Windows File Conversion (parts about unicode filename issues) http://xahlee.org/mswin/mac_windows_file_conv.html =E2=80=A2 Windows Font and Unicode http://xahlee.org/mswin/windows_font_unicode.html the above article contain tens of links to Wikipedia in appropriate places. Wikipedia has massive info in digestable form about these issues, one can spend a month on the above foreign char issues ... for some examples of mixed chinese & english text i work with, see: =E2=80=A2 Chinese Core Simplified Chars http://xahlee.org/lojban/simplified_chars.html =E2=80=A2 Ethology, Ethnology, and Lyrics http://xahlee.org/Periodic_dosage_dir/sanga_pemci/sanga_pemci.html Xah =E2=88=91 http://xahlee.org/ =E2=98=84