From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Xah Lee Newsgroups: gmane.emacs.help Subject: Re: utf8 char display in buffer Date: Fri, 12 Jun 2009 17:35:11 -0700 (PDT) Organization: http://groups.google.com Message-ID: <7781c409-7a47-4c61-b968-4c5589c217e1@s1g2000prd.googlegroups.com> References: <7I2dndeTy7sqkLLXnZ2dnUVZ_gmdnZ2d@sysmatrix.net> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1244914851 32335 80.91.229.12 (13 Jun 2009 17:40:51 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 13 Jun 2009 17:40:51 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Sat Jun 13 19:40:49 2009 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1MFXDk-0000iP-Ox for geh-help-gnu-emacs@m.gmane.org; Sat, 13 Jun 2009 19:40:49 +0200 Original-Received: from localhost ([127.0.0.1]:41316 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MFXDj-0001ok-OJ for geh-help-gnu-emacs@m.gmane.org; Sat, 13 Jun 2009 13:40:47 -0400 Original-Path: news.stanford.edu!newsfeed.stanford.edu!postnews.google.com!s1g2000prd.googlegroups.com!not-for-mail Original-Newsgroups: gnu.emacs.help Original-Lines: 116 Original-NNTP-Posting-Host: 76.102.12.87 Original-X-Trace: posting.google.com 1244853312 2001 127.0.0.1 (13 Jun 2009 00:35:12 GMT) Original-X-Complaints-To: groups-abuse@google.com Original-NNTP-Posting-Date: Sat, 13 Jun 2009 00:35:12 +0000 (UTC) Complaints-To: groups-abuse@google.com Injection-Info: s1g2000prd.googlegroups.com; posting-host=76.102.12.87; posting-account=bRPKjQoAAACxZsR8_VPXCX27T2YcsyMA User-Agent: G2/1.0 X-HTTP-UserAgent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/530.5 (KHTML, like Gecko) Chrome/2.0.172.31 Safari/530.5, gzip(gfe), gzip(gfe) Original-Xref: news.stanford.edu gnu.emacs.help:170001 X-Mailman-Approved-At: Sat, 13 Jun 2009 13:35:27 -0400 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:65255 Archived-At: On Jun 12, 3:23 pm, ken wrote: > On 06/12/2009 01:53 PM Xah Lee wrote: > > > On Jun 12, 7:54 am, ken wrote: > >> B) It would be helpful if the code which does the decoding of a file a= nd > >> renders it into the buffer display, if that part of it would throw an > >> error message when it encounters a character it doesn't know how to > >> display, i.e., when a little box character is displayed. After all, > >> isn't it an error when a little box is displayed in lieu of the correc= t > >> character? Possible error messages would be something like: "decoding > >> process can't find /path/to/charset.file" or "decoding process doesn't > >> have requisite permission to read /path/to/charset.file" or "invalid > >> character: [hex/decimal value]" or other. > > > some thought process in the above is not correct. > > Yet emacs puts a little box in the place of a character it cannot find > (or, per your explanation) possibly confused about. The fact remains > that the little box is not a correct rendering of the code. It is an > error... at least it is for me, because that's not what I typed in. So > it is an error. As an error, there should be a corresponding error > message, hopefully one (or more) which would help diagnose the problem. > It seems obvious that, given the long thread on this issue with no > resolution, we could use some help-- like an error message-- which would > help in diagnosis. > > Thanks for the information and the links though. i think displaying a error for each char that emacs cannot find a font for is just not feasible. The app can't know whether it used the right encoding. And even if the encoding used is correct, it can't deal with possible missing fonts in some of the characters in the char set. i don't have experience in this, but imagine, when a app gets a byte stream, and with a given charset/encoding. With that, it can decode byte length to map to the code points in the char set. (e.g. utf-8, utf-16, both don't have fixed byte-length for chars) After that done, you get a sequence of a code points (i.e. a sequence of integers). At this point, given a integer, you need to map this integere to a character in a font. There are many issues here... a font i guess is a set of glyphs... ultimately a set of integers. I'm not sure what sort of spec or standard specifies what each integer means (i.e. support your app now has a integer that represents B. Now suppose your app is set to use font Aria. Now, Aria is a set of integers, but by what standard that says what integer is B?)... Part of this step is what happens when Aria don't have that character. (i'm guessing a font also has data about what character set it contains...) But in anycase, finally we'll have a B from font Arial. Then it goes thru the whole display process... overall i think the technology we have today that actually display fonts and unicode text etc are extremely complex, not to mention vector based fonts and anti-aliasing and font-substitution etc techs. some interesting read here: http://en.wikipedia.org/wiki/Computer_font http://en.wikipedia.org/wiki/Anti-aliasing http://en.wikipedia.org/wiki/Font_rasterization http://en.wikipedia.org/wiki/Subpixel_rendering http://en.wikipedia.org/wiki/Font-substitution for most modern apps, like browsers, i think they all call OS's APIs to handle it. Some glimps over emacs dev list seems to suggest that emacs implements its own display system... on one hand it's bad because emacs misses out using all modern techs developed in 2 decades by Apple or Adobe or Microsoft, or some Open Source's work, on the other hand it is admirable in that it does it on its own... sorry am rambling a bit. You are right that the bottom line is that some things just rendered as squares and is a problem. Though, i wanted to say that my point was that it is unfeasible to issue a error for missing fonts or miss-interpretation of the encodings. Part of this is because theoretically there's no way to know that encoding chosen is correct. Part is because in practice missing font or bad chosen encoding is very common. If we all stick with ascii, everything is pretty good. If we stick to western langs, things are still not too bad. But once you have chinese, japanese, korean alphabets, or the ocational use of the many math symbols and greek letters, or adding cyrillic/russian alphabets or arabian alphabets ... the chances of missing font or missing encoding info is very high. i think a large part of the problem is that char set and encoding info is not part of the file. Things are getting better in the past decade with mime type and unicode standard. But give a byte stream, after being lucky of able to know it is text, there's still little way to know how to interpret it. The char set and encoding meta data often gets lost, implementation are often not robust, font for multi-lang usually are not there, and font-substitution tech just started. (according to Wikipedia, IE before 7 does not even have font substitution (which means, you really need such beast as =E2=80=9Cunicode font=E2=80=9D, namely a font that contains some tens or hundreds thousands = of glyphs)) i think all these issue only started to get addressed in the past decade since the globalization partly due to internet. Before, English speakers just stick with ascii and that's pretty sufficient. Each western lang region stick with their particular encoding for a few special chars in their alphabet. Only when things started to mix they get more complex, and now with Chinese & japanese etc. With unicode, the use of math symbols also becomes more common. Before that, it's just ascii markup... speaking of this. Emacs and FSF docs still stick with 1980s's `quote hack', and arrows like this -> =3D> ... very extremely stupid. Of course i filed polite bug reports, and have argued here too heated, but basically fallen to no ears. Somethings just is impossible to progress in the FSF world. Xah =E2=88=91 http://xahlee.org/ =E2=98=84