From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Xah Lee Newsgroups: gmane.emacs.help Subject: Re: utf8 char display in buffer Date: Fri, 12 Jun 2009 10:53:39 -0700 (PDT) Organization: http://groups.google.com Message-ID: References: <7I2dndeTy7sqkLLXnZ2dnUVZ_gmdnZ2d@sysmatrix.net> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1244839914 13566 80.91.229.12 (12 Jun 2009 20:51:54 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 12 Jun 2009 20:51:54 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Fri Jun 12 22:51:52 2009 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1MFDj5-0002Qa-Ng for geh-help-gnu-emacs@m.gmane.org; Fri, 12 Jun 2009 22:51:52 +0200 Original-Received: from localhost ([127.0.0.1]:36994 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MFDj4-0001wA-Tn for geh-help-gnu-emacs@m.gmane.org; Fri, 12 Jun 2009 16:51:50 -0400 Original-Path: news.stanford.edu!newsfeed.stanford.edu!postnews.google.com!y34g2000prb.googlegroups.com!not-for-mail Original-Newsgroups: gnu.emacs.help Original-Lines: 165 Original-NNTP-Posting-Host: 76.102.12.87 Original-X-Trace: posting.google.com 1244829219 9496 127.0.0.1 (12 Jun 2009 17:53:39 GMT) Original-X-Complaints-To: groups-abuse@google.com Original-NNTP-Posting-Date: Fri, 12 Jun 2009 17:53:39 +0000 (UTC) Complaints-To: groups-abuse@google.com Injection-Info: y34g2000prb.googlegroups.com; posting-host=76.102.12.87; posting-account=bRPKjQoAAACxZsR8_VPXCX27T2YcsyMA User-Agent: G2/1.0 X-HTTP-UserAgent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/530.5 (KHTML, like Gecko) Chrome/2.0.172.31 Safari/530.5, gzip(gfe), gzip(gfe) Original-Xref: news.stanford.edu gnu.emacs.help:169989 X-Mailman-Approved-At: Fri, 12 Jun 2009 16:51:04 -0400 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:65216 Archived-At: On Jun 12, 7:54 am, ken wrote: > B) It would be helpful if the code which does the decoding of a file and > renders it into the buffer display, if that part of it would throw an > error message when it encounters a character it doesn't know how to > display, i.e., when a little box character is displayed. After all, > isn't it an error when a little box is displayed in lieu of the correct > character? Possible error messages would be something like: "decoding > process can't find /path/to/charset.file" or "decoding process doesn't > have requisite permission to read /path/to/charset.file" or "invalid > character: [hex/decimal value]" or other. some thought process in the above is not correct. In general, a program just read a text file as a byte stream, and using a encoding scheme to interpret it, the program has little way to determine if the encoding is correct. Theoretically, it could check with common phrases but that is generally not done by the software we use daily. (some program does scan text guess a encoding, but not always correct) here's some general technical issues and experiences about using foreign chars: =E2=80=A2 the software needs to know what encoding & char set is used in or= der to interpret the binary stream. If you don't specifically set it, typically it assumes ascii or some iso latin char set. (of software in USA anyway) =E2=80=A2 today's software generally don't contain any extra heuristics to check if the encoding used is actually correct. There is no technical way to check that in general. It can be only heuristics, i.e. guesses. e.g. browsers will often guess when reading a page that doesn't have encoding info. =E2=80=A2 even when the encoding is correct, the software needs all the pro= per fonts to display it. Or, rely on some font-replacement technology, e.g. when it finds a char which the current font doesn't have, it uses another font for that char. (in the case of Chinese, this often results in ugly text of mixed char style, some appear thin, some thick, some squarely (like sans-serif), some calligraphic, some bit- mapped) Windows OS and OS X both has font-replacement technology, as well as all the major browsers for both os x and windows. This font replacement technology, however, is not perfect. So, sometimes you'll see squares or question marks here or there, especially on some chars that's not widely used (e.g. math symbols in unicode, double right arrow, tech symbols such as Apple's command key and option key, triple asterisk, etc.). =E2=80=A2 when writing a file, the software needs to use a encoding to writ= e it. Just like reading, if you haven't explicitly set it, typically it uses ascii or some iso latin char set, in most western lang countries. =E2=80=A2 when you use a software to open a text but with wrong encoding in= fo, the result is gibberish. the above applies not just to emacs, but applies to all apps. Some commentary are based on my experiences with browsers, web pages, word processors, online forums, mailing list, email apps, instant messaging chat apps, etc, on both mac and windows. technically, the issues involved is char set, encoding, font. ( the concept of char set and encoding are independent but is often mixed together in a spec, esp earlier ones). i use mixed chinese & english in single file often and in both mac os x and windows. They work well. On the mac, my emacs is version 22.x. On win, it is emacs23. My encoding in emacs is set to utf-8. I've wrote a lot about these issues, the following docs might be helpful. =E2=80=A2 Emacs and Unicode Tips http://xahlee.org/emacs/emacs_n_unicode.html =E2=80=A2 Unicode Characters Example http://xahlee.org/Periodic_dosage_dir/t1/20040505_unicode.html =E2=80=A2 the Journey of a Foreign Character thru Internet http://xahlee.org/Periodic_dosage_dir/t2/non-ascii_journey.html =E2=80=A2 Converting a File's Encoding with Python http://xahlee.org/perl-python/charset_encoding.html =E2=80=A2 Character Sets and Encoding in HTML http://xahlee.org/js/html_chars.html =E2=80=A2 The Complexity And Tedium of Software Engineering (parts about unicode problem with unison and emacs) http://xahlee.org/UnixResource_dir/writ/programer_frustration.html =E2=80=A2 Mac and Windows File Conversion (parts about unicode filename issues) http://xahlee.org/mswin/mac_windows_file_conv.html =E2=80=A2 Windows Font and Unicode http://xahlee.org/mswin/windows_font_unicode.html the above article contain tens of links to Wikipedia in appropriate places. Wikipedia has massive info in digestible form about these issues, one can spend a month on the above foreign char issues ... for some examples of mixed chinese & english text i work with, see: =E2=80=A2 Chinese Core Simplified Chars http://xahlee.org/lojban/simplified_chars.html =E2=80=A2 Ethology, Ethnology, and Lyrics http://xahlee.org/Periodic_dosage_dir/sanga_pemci/sanga_pemci.html Xah =E2=88=91 http://xahlee.org/ =E2=98=84 On Jun 12, 9:48 am, "B. T. Raven" wrote: > I wouldn't be surprised if the gaps and overlaps in the CJK ranges of > glyphs weren't so complicated that many characters from the following > encodings may not be included in utf-8, especially if they are not > precomposed. Try some of these encodings to see if some of the empty > boxes are resolved into characters: > > chinese-big5 > chinese-hz > chinese-iso-7bit > chinese-iso-8bit > chinese-iso-8bit-with-esc > cn-big5 > cn-gb > cn-gb-2312 > iso-2022-cjk > iso-2022-cn > iso-2022-cn-ext most chinese encodings are subset or identical to unicode's charset. In particular, the current, mostly widely used chinese charset the GB 18030, actually is just unicode. see http://en.wikipedia.org/wiki/GB_18030 Note also, that means china's GB 18030 contain the entirely of traditional chars in unicode too. (though, i don't know about how big5 relates to unicode ) the list you gave above is from emacs? emacs's list always seems strange to me... haven't really looked into it. maybe emacs's list is really encompassing of all encoding that've existed, but it also could be just screwed up like many open source things. For example, it invents its own names by mixing up char set encoding with concepts of EOL convention. btw, who actually coded the low down levels of char encoding in emacs? e.g. especially unicode, since it came after richard stallman still doing the bulk of emacs. That person should be admirable. lol. Xah =E2=88=91 http://xahlee.org/ =E2=98=84