* Re: utf8 char display in buffer [not found] <mailman.227.1244485995.2239.help-gnu-emacs@gnu.org> @ 2009-06-08 19:10 ` Teemu Likonen 2009-06-08 19:52 ` Xah Lee ` (2 subsequent siblings) 3 siblings, 0 replies; 32+ messages in thread From: Teemu Likonen @ 2009-06-08 19:10 UTC (permalink / raw) To: gebser; +Cc: GNU Emacs List On 2009-06-08 14:33 (-0400), ken wrote: > I already use a few utf8 characters in emacs (and in web pages), but > recently needed to use a couple more. One is an 'a' with a horizontal > line above it, the other an 'i' with a vertical line above it. How do > I input these into a buffer? Some keyboards (Finnish, for example) can produce those characters (semi-)directly but through Emacs's input methods it's possible with just basic Ascii keys. For example, turn on "TeX" input method (C-x RET C-\ TeX RET) and type \=a for "ā" and \=i for "ī". You can also use "ucs" input method and type Unicode code points directly: type u0101 for "ā" and u012b for "ī". There are probably some language-specific input methods too which may have even easier ways for inputting these characters. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer [not found] <mailman.227.1244485995.2239.help-gnu-emacs@gnu.org> 2009-06-08 19:10 ` utf8 char display in buffer Teemu Likonen @ 2009-06-08 19:52 ` Xah Lee 2009-06-09 10:52 ` ken 2009-06-08 20:43 ` B. T. Raven 2009-06-11 12:03 ` Teemu Likonen 3 siblings, 1 reply; 32+ messages in thread From: Xah Lee @ 2009-06-08 19:52 UTC (permalink / raw) To: help-gnu-emacs On Jun 8, 11:33 am, ken <geb...@mousecar.com> wrote: > Hey, group, > > I already use a few utf8 characters in emacs (and in web pages), but > recently needed to use a couple more. One is an 'a' with a horizontal > line above it, the other an 'i' with a vertical line above it. How do I > input these into a buffer? i define keys to insert unicode chars that i frequently use. e.g. (global-set-key (kbd "<kp-6>") "→") (global-set-key (kbd "M-i a") "α") (global-set-key (kbd "M-i b") "β") (global-set-key (kbd "M-i t") "θ") you can also insert unicode by its hex value. Alt+x ucs-insert. There are few other ways... some more tips here • Emacs and Unicode Tips http://xahlee.org/emacs/emacs_n_unicode.html Xah ∑ http://xahlee.org/ ☄ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-08 19:52 ` Xah Lee @ 2009-06-09 10:52 ` ken 0 siblings, 0 replies; 32+ messages in thread From: ken @ 2009-06-09 10:52 UTC (permalink / raw) Cc: help-gnu-emacs On 06/08/2009 03:52 PM Xah Lee wrote: >> .... > > i define keys to insert unicode chars that i frequently use. e.g. > > (global-set-key (kbd "<kp-6>") "→") > (global-set-key (kbd "M-i a") "α") > (global-set-key (kbd "M-i b") "β") > (global-set-key (kbd "M-i t") "θ") It's probably just me, but with the so many foreign characters I use, remembering all the many key mappings becomes more than my little brain can manage. So I prefer to create a menu of character entities. html-helper-mode (i.e., not html-mode) already has such a menu which I've added to using "(mapchar 'html-helper-add-tag ...". This menu allows me to look up a 'character' which I can't remember *and* gives me a reminder of what its key combo is. My (too old) version of emacs, however, doesn't have a "character entities" menu for regular (non-html) buffers. I've already got too much on my plate for the moment, so this isn't a project for me right now. But later.... > .... > > some more tips here > > • Emacs and Unicode Tips > http://xahlee.org/emacs/emacs_n_unicode.html > > .... Nice web page. (Bookmarked.) Thanks. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer [not found] <mailman.227.1244485995.2239.help-gnu-emacs@gnu.org> 2009-06-08 19:10 ` utf8 char display in buffer Teemu Likonen 2009-06-08 19:52 ` Xah Lee @ 2009-06-08 20:43 ` B. T. Raven 2009-06-08 20:49 ` B. T. Raven ` (2 more replies) 2009-06-11 12:03 ` Teemu Likonen 3 siblings, 3 replies; 32+ messages in thread From: B. T. Raven @ 2009-06-08 20:43 UTC (permalink / raw) To: help-gnu-emacs ken wrote: > Hey, group, > > I already use a few utf8 characters in emacs (and in web pages), but > recently needed to use a couple more. One is an 'a' with a horizontal > line above it, the other an 'i' with a vertical line above it. How do I > input these into a buffer? > > > tia, > ken > C-x ret C-\ latin-4-postfix then a,e,i,o,u followed by hyphen generate macroned vowels If you don't want all these then you could just put something like this in .emacs (global-set-key "\C-ca" (lambda () (interactive) (insert ?ā ))) (global-set-key "\C-ci" (lambda () (interactive) (insert ?ī ))) assuming you have these C-c combos free. Ed ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-08 20:43 ` B. T. Raven @ 2009-06-08 20:49 ` B. T. Raven 2009-06-08 22:49 ` ken 2009-06-09 10:24 ` ken [not found] ` <mailman.289.1244543082.2239.help-gnu-emacs@gnu.org> 2 siblings, 1 reply; 32+ messages in thread From: B. T. Raven @ 2009-06-08 20:49 UTC (permalink / raw) To: help-gnu-emacs B. T. Raven wrote: > ken wrote: >> Hey, group, >> >> I already use a few utf8 characters in emacs (and in web pages), but >> recently needed to use a couple more. One is an 'a' with a horizontal >> line above it, the other an 'i' with a vertical line above it. How do I >> input these into a buffer? >> >> >> tia, >> ken >> Oops, I see you said i with VERTICAL line. What is that character? Any of these? í ï î ì If so substitute for i with macron below. > > C-x ret C-\ latin-4-postfix > > then a,e,i,o,u followed by hyphen generate macroned vowels > > If you don't want all these then you could just put something like this > in .emacs > > (global-set-key "\C-ca" (lambda () (interactive) (insert ?ā ))) > (global-set-key "\C-ci" (lambda () (interactive) (insert ?ī ))) > > assuming you have these C-c combos free. > > Ed ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-08 20:49 ` B. T. Raven @ 2009-06-08 22:49 ` ken 0 siblings, 0 replies; 32+ messages in thread From: ken @ 2009-06-08 22:49 UTC (permalink / raw) To: GNU Emacs List On 06/08/2009 04:49 PM B. T. Raven wrote: > B. T. Raven wrote: >> ken wrote: >>> Hey, group, >>> >>> I already use a few utf8 characters in emacs (and in web pages), but >>> recently needed to use a couple more. One is an 'a' with a horizontal >>> line above it, the other an 'i' with a vertical line above it. How do I >>> input these into a buffer? >>> >>> >>> tia, >>> ken >>> > > Oops, I see you said i with VERTICAL line. What is that character? > Any of these? í ï î ì If so substitute for i with macron below. > >>.... The Oops is mine. I meant to say "horizontal" for both. So your previous email did it all for me. Thanks. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-08 20:43 ` B. T. Raven 2009-06-08 20:49 ` B. T. Raven @ 2009-06-09 10:24 ` ken [not found] ` <mailman.289.1244543082.2239.help-gnu-emacs@gnu.org> 2 siblings, 0 replies; 32+ messages in thread From: ken @ 2009-06-09 10:24 UTC (permalink / raw) To: GNU Emacs List On 06/08/2009 04:43 PM B. T. Raven wrote: > ken wrote: >> Hey, group, >> >> I already use a few utf8 characters in emacs (and in web pages), but >> recently needed to use a couple more. One is an 'a' with a horizontal >> line above it, the other an 'i' with a horizontal line above it. How do I >> input these into a buffer? >> >> >> tia, >> ken >> > > C-x ret C-\ latin-4-postfix > > then a,e,i,o,u followed by hyphen generate macroned vowels > > If you don't want all these then you could just put something like this > in .emacs > > (global-set-key "\C-ca" (lambda () (interactive) (insert ?ā ))) > (global-set-key "\C-ci" (lambda () (interactive) (insert ?ī ))) > > assuming you have these C-c combos free. > > Ed Fantastic! But... when I save and close the buffer and then open it up again, in place of the beautiful and correct characters, there are little boxes. I tried using ‘C-x C-m c utf-8 RET’ prior to 'C-x C-f filename'... but no joy. Same no-go with 'C-x C-m c mule-utf-8 RET'. The fact that these non-English characters display properly in the buffer initially tells me that I have the requisite fonts installed. So what little connection is emacs not making (and how do I tell it to make that connection)? Thanks, all. ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <mailman.289.1244543082.2239.help-gnu-emacs@gnu.org>]
* Re: utf8 char display in buffer [not found] ` <mailman.289.1244543082.2239.help-gnu-emacs@gnu.org> @ 2009-06-09 13:03 ` B. T. Raven 2009-06-09 14:51 ` ken [not found] ` <mailman.297.1244559110.2239.help-gnu-emacs@gnu.org> 0 siblings, 2 replies; 32+ messages in thread From: B. T. Raven @ 2009-06-09 13:03 UTC (permalink / raw) To: help-gnu-emacs ken wrote: > On 06/08/2009 04:43 PM B. T. Raven wrote: >> ken wrote: >>> Hey, group, >>> >>> I already use a few utf8 characters in emacs (and in web pages), but >>> recently needed to use a couple more. One is an 'a' with a horizontal >>> line above it, the other an 'i' with a horizontal line above it. How do I >>> input these into a buffer? >>> >>> >>> tia, >>> ken >>> >> C-x ret C-\ latin-4-postfix >> >> then a,e,i,o,u followed by hyphen generate macroned vowels >> >> If you don't want all these then you could just put something like this >> in .emacs >> >> (global-set-key "\C-ca" (lambda () (interactive) (insert ?ā ))) >> (global-set-key "\C-ci" (lambda () (interactive) (insert ?ī ))) >> >> assuming you have these C-c combos free. >> >> Ed > > Fantastic! But... when I save and close the buffer and then open it up > again, in place of the beautiful and correct characters, there are > little boxes. After you see then correctly in the buffer do: C-x ret c utf-8 then C-x C-s Now next time you load that file it should appear correctly. ā and ī are not in iso-8859-1 and so you must use a more comprehensive coding system. > > I tried using ‘C-x C-m c utf-8 RET’ prior to 'C-x C-f filename'... but > no joy. Same no-go with 'C-x C-m c mule-utf-8 RET'. > > The fact that these non-English characters display properly in the > buffer initially tells me that I have the requisite fonts installed. So > what little connection is emacs not making (and how do I tell it to make > that connection)? If you use utf-8 a lot you can put ;; -*- coding: utf-8[;] -*- into the first line of the file. I don't know whether that sem in brackets is needed or not. > > Thanks, all. > > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-09 13:03 ` B. T. Raven @ 2009-06-09 14:51 ` ken [not found] ` <mailman.297.1244559110.2239.help-gnu-emacs@gnu.org> 1 sibling, 0 replies; 32+ messages in thread From: ken @ 2009-06-09 14:51 UTC (permalink / raw) To: GNU Emacs List On 06/09/2009 09:03 AM B. T. Raven wrote: > ken wrote: >> On 06/08/2009 04:43 PM B. T. Raven wrote: >>> ken wrote: >>>> .... >>>> >>> C-x ret C-\ latin-4-postfix >>> >>> then a,e,i,o,u followed by hyphen generate macroned vowels >>> >>> .... >> >> Fantastic! But... when I save and close the buffer and then open it up >> again, in place of the beautiful and correct characters, there are >> little boxes. > > After you see then correctly in the buffer do: > > C-x ret c utf-8 > > then > > C-x C-s > > Now next time you load that file it should appear correctly. > ā and ī are not in iso-8859-1 and so you must use a more comprehensive > coding system. Hmmm... it doesn't. Doing everything just as you say above, I still get the little boxes in place of the non-English characters. When after reloading the buffer, I run "describe-coding-system" on this buffer, I get: ============================================= Coding system for saving this buffer: u -- mule-utf-8-unix Default coding system (for new files): u -- mule-utf-8 (alias: utf-8) Coding system for keyboard input: nil Coding system for terminal output: 0 -- iso-latin-9 (alias: iso-8859-15 latin-9 latin-0) Defaults for subprocess I/O: decoding: u -- mule-utf-8 (alias: utf-8) encoding: u -- mule-utf-8 (alias: utf-8) Priority order for recognizing coding systems when reading files: 1. mule-utf-8 (alias: utf-8) 2. iso-latin-1 (alias: iso-8859-1 latin-1) 3. iso-2022-jp (alias: junet) 4. iso-2022-7bit 5. iso-2022-7bit-lock (alias: iso-2022-int-1) 6. iso-2022-8bit-ss2 7. emacs-mule 8. raw-text 9. japanese-shift-jis (alias: shift_jis sjis) 10. chinese-big5 (alias: big5 cn-big5) 11. no-conversion (alias: binary) Other coding systems cannot be distinguished automatically from these, and therefore cannot be recognized automatically with the present coding system priorities. The followings are decoded correctly but recognized as iso-2022-7bit-lock: iso-2022-7bit-ss2 iso-2022-7bit-lock-ss2 iso-2022-cn iso-2022-cn-ext iso-2022-jp-2 iso-2022-kr .... ================================================================== I don't know... does utf-8 or mule-utf-8 contain latin-4, greek, and/or German characters? (This file has some of each.) >> >> I tried using ‘C-x C-m c utf-8 RET’ prior to 'C-x C-f filename'... but >> no joy. Same no-go with 'C-x C-m c mule-utf-8 RET'. >> >> The fact that these non-English characters display properly in the >> buffer initially tells me that I have the requisite fonts installed. So >> what little connection is emacs not making (and how do I tell it to make >> that connection)? > > If you use utf-8 a lot you can put ;; -*- coding: utf-8[;] -*- into the > first line of the file. I don't know whether that sem in brackets is > needed or not. Sorry, I should have mentioned that I have this (with the semi-colon) at the top of the file. Let me also say that, though the little boxes appear in the emacs buffer, the proper non-English characters appear when the file is loaded into firefox. (Yeah, this emacs file is an HTML page.) > >> >> Thanks, all. >> >> ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <mailman.297.1244559110.2239.help-gnu-emacs@gnu.org>]
* Re: utf8 char display in buffer [not found] ` <mailman.297.1244559110.2239.help-gnu-emacs@gnu.org> @ 2009-06-10 1:34 ` B. T. Raven 2009-06-10 14:03 ` Lewis Perin 0 siblings, 1 reply; 32+ messages in thread From: B. T. Raven @ 2009-06-10 1:34 UTC (permalink / raw) To: help-gnu-emacs ken wrote: > On 06/09/2009 09:03 AM B. T. Raven wrote: >> ken wrote: >>> On 06/08/2009 04:43 PM B. T. Raven wrote: >>>> ken wrote: >>>>> .... >>>>> >>>> C-x ret C-\ latin-4-postfix >>>> >>>> then a,e,i,o,u followed by hyphen generate macroned vowels >>>> >>>> .... >>> Fantastic! But... when I save and close the buffer and then open it up >>> again, in place of the beautiful and correct characters, there are >>> little boxes. >> After you see then correctly in the buffer do: >> >> C-x ret c utf-8 >> >> then >> >> C-x C-s >> >> Now next time you load that file it should appear correctly. >> ā and ī are not in iso-8859-1 and so you must use a more comprehensive >> coding system. > > Hmmm... it doesn't. Doing everything just as you say above, I still get > the little boxes in place of the non-English characters. > > When after reloading the buffer, I run "describe-coding-system" on this > buffer, I get: > > ============================================= > Coding system for saving this buffer: > u -- mule-utf-8-unix > Default coding system (for new files): > u -- mule-utf-8 (alias: utf-8) > Coding system for keyboard input: > nil > Coding system for terminal output: > 0 -- iso-latin-9 (alias: iso-8859-15 latin-9 latin-0) > Defaults for subprocess I/O: > decoding: u -- mule-utf-8 (alias: utf-8) > encoding: u -- mule-utf-8 (alias: utf-8) > > Priority order for recognizing coding systems when reading files: > 1. mule-utf-8 (alias: utf-8) > 2. iso-latin-1 (alias: iso-8859-1 latin-1) > 3. iso-2022-jp (alias: junet) > 4. iso-2022-7bit > 5. iso-2022-7bit-lock (alias: iso-2022-int-1) > 6. iso-2022-8bit-ss2 > 7. emacs-mule > 8. raw-text > 9. japanese-shift-jis (alias: shift_jis sjis) > 10. chinese-big5 (alias: big5 cn-big5) > 11. no-conversion (alias: binary) > > Other coding systems cannot be distinguished automatically > from these, and therefore cannot be recognized automatically > with the present coding system priorities. > > The followings are decoded correctly but recognized as iso-2022-7bit-lock: > iso-2022-7bit-ss2 iso-2022-7bit-lock-ss2 iso-2022-cn iso-2022-cn-ext > iso-2022-jp-2 iso-2022-kr > > .... > ================================================================== > > I don't know... does utf-8 or mule-utf-8 contain latin-4, greek, and/or > German characters? (This file has some of each.) > > >>> I tried using ‘C-x C-m c utf-8 RET’ prior to 'C-x C-f filename'... but >>> no joy. Same no-go with 'C-x C-m c mule-utf-8 RET'. >>> >>> The fact that these non-English characters display properly in the >>> buffer initially tells me that I have the requisite fonts installed. So >>> what little connection is emacs not making (and how do I tell it to make >>> that connection)? >> If you use utf-8 a lot you can put ;; -*- coding: utf-8[;] -*- into the >> first line of the file. I don't know whether that sem in brackets is >> needed or not. > > Sorry, I should have mentioned that I have this (with the semi-colon) at > the top of the file. > > Let me also say that, though the little boxes appear in the emacs > buffer, the proper non-English characters appear when the file is loaded > into firefox. (Yeah, this emacs file is an HTML page.) > > > >>> Thanks, all. Don't know. Your problem has just escalated above my pay grade. I don't know what it means that the files display okay in FF. I just loaded my .emacs into the browser and it looks fine (has many exotic non Latin-1 characters in it). You are using GUI Emacs and not terminal, right. You could try these settings from my ver 22 .emacs, just for fun: (set-language-environment 'UTF-8) (set-default-coding-systems 'utf-8) (setq file-name-coding-system 'utf-8) (setq default-buffer-file-coding-system 'utf-8) (setq coding-system-for-write 'utf-8) (set-keyboard-coding-system 'utf-8) (set-terminal-coding-system 'utf-8) (set-clipboard-coding-system 'utf-8) (set-selection-coding-system 'utf-8) (prefer-coding-system 'utf-8) (modify-coding-system-alist 'process "[cC][mM][dD][pP][rR][oO][xX][yY]" 'utf-8-dos) and try C-x ret c utf-8 C-x C-f to open the file. or install version 23.x w32 binary into a different directory from here http://alpha.gnu.org/gnu/emacs/pretest/windows/ I don't think you need a .emacs with ver 23 in dealing with utf-8 since its internal representation is unicode. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-10 1:34 ` B. T. Raven @ 2009-06-10 14:03 ` Lewis Perin 2009-06-11 3:21 ` B. T. Raven 0 siblings, 1 reply; 32+ messages in thread From: Lewis Perin @ 2009-06-10 14:03 UTC (permalink / raw) To: help-gnu-emacs I've been following this thread closely because I have the original poster's problem, only the characters that give me trouble are some - not many, actually - Chinese characters, e.g. ni3, the normal second person pronoun. And, as with the original poster, the troublesome characters, when copied and pasted to other applications from Emacs, display perfectly. "B. T. Raven" <nihil@nihilo.net> writes: > [...] > (set-language-environment 'UTF-8) > (set-default-coding-systems 'utf-8) > (setq file-name-coding-system 'utf-8) > (setq default-buffer-file-coding-system 'utf-8) > (setq coding-system-for-write 'utf-8) > (set-keyboard-coding-system 'utf-8) > (set-terminal-coding-system 'utf-8) > (set-clipboard-coding-system 'utf-8) > (set-selection-coding-system 'utf-8) > (prefer-coding-system 'utf-8) > (modify-coding-system-alist 'process > "[cC][mM][dD][pP][rR][oO][xX][yY]" 'utf-8-dos) > > > and try C-x ret c utf-8 > C-x C-f > > to open the file. I tried this, but it didn't help. Emacs 22.3 / Win32. /Lew --- Lew Perin / perin@acm.org http://www.panix.com/~perin/babelcarp.html ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-10 14:03 ` Lewis Perin @ 2009-06-11 3:21 ` B. T. Raven 2009-06-12 14:54 ` ken [not found] ` <mailman.522.1244818530.2239.help-gnu-emacs@gnu.org> 0 siblings, 2 replies; 32+ messages in thread From: B. T. Raven @ 2009-06-11 3:21 UTC (permalink / raw) To: help-gnu-emacs Lewis Perin wrote: > I've been following this thread closely because I have the original > poster's problem, only the characters that give me trouble are some - > not many, actually - Chinese characters, e.g. ni3, the normal second > person pronoun. And, as with the original poster, the troublesome > characters, when copied and pasted to other applications from Emacs, > display perfectly. > > "B. T. Raven" <nihil@nihilo.net> writes: > >> [...] >> (set-language-environment 'UTF-8) >> (set-default-coding-systems 'utf-8) >> (setq file-name-coding-system 'utf-8) >> (setq default-buffer-file-coding-system 'utf-8) >> (setq coding-system-for-write 'utf-8) >> (set-keyboard-coding-system 'utf-8) >> (set-terminal-coding-system 'utf-8) >> (set-clipboard-coding-system 'utf-8) >> (set-selection-coding-system 'utf-8) >> (prefer-coding-system 'utf-8) >> (modify-coding-system-alist 'process >> "[cC][mM][dD][pP][rR][oO][xX][yY]" 'utf-8-dos) >> >> >> and try C-x ret c utf-8 >> C-x C-f >> >> to open the file. > > I tried this, but it didn't help. Emacs 22.3 / Win32. Even on Emacs 23 although I see the characters in the buffer, I can't save the following as utf-8: nǐ hǎo 你 好 u+4f60 and u+597d Or at least not so as to be readable with 22.3. Both versions are using Arial Unicode MS. Why is that? > > /Lew > --- > Lew Perin / perin@acm.org > http://www.panix.com/~perin/babelcarp.html ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-11 3:21 ` B. T. Raven @ 2009-06-12 14:54 ` ken 2009-06-13 3:30 ` Eli Zaretskii [not found] ` <mailman.522.1244818530.2239.help-gnu-emacs@gnu.org> 1 sibling, 1 reply; 32+ messages in thread From: ken @ 2009-06-12 14:54 UTC (permalink / raw) To: GNU Emacs List Ed, Thanks for distributing. Everyone responding to this thread, Please either CC me when posting about this issue or else edit the "To" field so that your response comes to the whole list. I'd like to get everyone's input. Thanks. Lewis, Thanks for posting. It's lonely out there when you're the only one with a particular problem. To make sure we're suffering the same cyber-indignity, here's the scenario as I see it (from an older version of emacs running on Linux): 0) Some others and myself want to include some non-English characters in a file being edited in emacs. Problems arise, however: 1) In a buffer which is already utf-8 encoded, I set the appropriate input method, type in the desired characters. They display just peachy and there is happiness in EmacsLand. 2) I save the buffer to a file, then close the buffer. 3) I visit the same file (i.e., load it again into emacs). Because it has <!-- -*- coding: utf-8; -*- --> as the first line, it opens utf-8 encoded. This is confirmed by the presence of a 'u' as the second character in the status bar. 4) The text in the buffer displays fine, except that in place of each of those non-English characters is a little empty box. With the cursor on one of those boxes, an 'a' with a horizontal bar above it, doing "C-x =", emacs returns "Char: ā (01210041, 331809, 0x51021, file ...)". (While, in emacs the character after "Char:" is a little box, if I load this same file into Firefox, that same character appears as it should, as an 'a' with a horizontal bar above it. How it appears in your email client will depend upon your email client.) A) The fact that, as described in (4), the characters display correctly in Firefox, but not in emacs indicates that emacs is not drawing on the needed character set. Yet, the fact that in (1) the characters initially display correctly (when first input) indicates that the needed character set is present on the system and emacs can find it and has permission access it. Further, we would think that emacs would throw out an error message if either of these conditions were not met... and it doesn't. We can only assume that, when visiting and then decoding a file and pulling into a buffer for display, emacs is not even asking for the proper character set when encountering a non-English character. This is where I would start to look for the error. B) It would be helpful if the code which does the decoding of a file and renders it into the buffer display, if that part of it would throw an error message when it encounters a character it doesn't know how to display, i.e., when a little box character is displayed. After all, isn't it an error when a little box is displayed in lieu of the correct character? Possible error messages would be something like: "decoding process can't find /path/to/charset.file" or "decoding process doesn't have requisite permission to read /path/to/charset.file" or "invalid character: [hex/decimal value]" or other. On 06/10/2009 11:21 PM B. T. Raven wrote: > Lewis Perin wrote: >> I've been following this thread closely because I have the original >> poster's problem, only the characters that give me trouble are some - >> not many, actually - Chinese characters, e.g. ni3, the normal second >> person pronoun. And, as with the original poster, the troublesome >> characters, when copied and pasted to other applications from Emacs, >> display perfectly. >> >> "B. T. Raven" <nihil@nihilo.net> writes: >> >>> [...] >>> (set-language-environment 'UTF-8) >>> (set-default-coding-systems 'utf-8) >>> (setq file-name-coding-system 'utf-8) >>> (setq default-buffer-file-coding-system 'utf-8) >>> (setq coding-system-for-write 'utf-8) >>> (set-keyboard-coding-system 'utf-8) >>> (set-terminal-coding-system 'utf-8) >>> (set-clipboard-coding-system 'utf-8) >>> (set-selection-coding-system 'utf-8) >>> (prefer-coding-system 'utf-8) >>> (modify-coding-system-alist 'process >>> "[cC][mM][dD][pP][rR][oO][xX][yY]" 'utf-8-dos) >>> >>> >>> and try C-x ret c utf-8 >>> C-x C-f >>> >>> to open the file. >> >> I tried this, but it didn't help. Emacs 22.3 / Win32. > > Even on Emacs 23 although I see the characters in the buffer, I can't > save the following as utf-8: > > nǐ hǎo 你 好 > u+4f60 and u+597d > > Or at least not so as to be readable with 22.3. Both versions are using > Arial Unicode MS. > > Why is that? > > >> >> /Lew >> --- >> Lew Perin / perin@acm.org >> http://www.panix.com/~perin/babelcarp.html ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-12 14:54 ` ken @ 2009-06-13 3:30 ` Eli Zaretskii 0 siblings, 0 replies; 32+ messages in thread From: Eli Zaretskii @ 2009-06-13 3:30 UTC (permalink / raw) To: help-gnu-emacs > Date: Fri, 12 Jun 2009 10:54:23 -0400 > From: ken <gebser@mousecar.com> > 1) In a buffer which is already utf-8 encoded, I set the appropriate > input method, type in the desired characters. They display just peachy > and there is happiness in EmacsLand. > > 2) I save the buffer to a file, then close the buffer. > > 3) I visit the same file (i.e., load it again into emacs). Because it > has <!-- -*- coding: utf-8; -*- --> as the first line, it opens > utf-8 encoded. This is confirmed by the presence of a 'u' as the second > character in the status bar. > > 4) The text in the buffer displays fine, except that in place of each of > those non-English characters is a little empty box. With the cursor on > one of those boxes, an 'a' with a horizontal bar above it, doing "C-x > =", emacs returns "Char: ā (01210041, 331809, 0x51021, file ...)". Please post here the full output of "C-u C-x =" (from a buffer popped up by Emacs) for these characters, both when you type them using the appropriate input method and they are displayed correctly (as in 1) above), and when you see them as empty boxes after revisiting the file. The differences between these two cases should give you a hint what is wrong; if not, someone else here might have ideas. ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <mailman.522.1244818530.2239.help-gnu-emacs@gnu.org>]
* Re: utf8 char display in buffer [not found] ` <mailman.522.1244818530.2239.help-gnu-emacs@gnu.org> @ 2009-06-12 15:39 ` Lewis Perin 2009-06-12 16:48 ` B. T. Raven 2009-06-12 17:27 ` Xah Lee 1 sibling, 1 reply; 32+ messages in thread From: Lewis Perin @ 2009-06-12 15:39 UTC (permalink / raw) To: help-gnu-emacs ken <gebser@mousecar.com> writes: > [...] > Lewis, > > Thanks for posting. It's lonely out there when you're the only one with > a particular problem. The few, the proud... > To make sure we're suffering the same cyber-indignity, here's the > scenario as I see it (from an older version of emacs running on > Linux): > > 0) Some others and myself want to include some non-English characters in > a file being edited in emacs. Problems arise, however: > > 1) In a buffer which is already utf-8 encoded, I set the appropriate > input method, type in the desired characters. They display just peachy > and there is happiness in EmacsLand. > > 2) I save the buffer to a file, then close the buffer. > > 3) I visit the same file (i.e., load it again into emacs). Because it > has <!-- -*- coding: utf-8; -*- --> as the first line, it opens > utf-8 encoded. This is confirmed by the presence of a 'u' as the second > character in the status bar. I haven't been inserting that special first line. > 4) The text in the buffer displays fine, except that in place of each of > those non-English characters is a little empty box. With the cursor on > one of those boxes, an 'a' with a horizontal bar above it, doing "C-x > =", emacs returns "Char: ā (01210041, 331809, 0x51021, file ...)". > (While, in emacs the character after "Char:" is a little box, if I load > this same file into Firefox, that same character appears as it should, > as an 'a' with a horizontal bar above it. How it appears in your email > client will depend upon your email client.) My situation differs in that most of the non-ASCII characters (Chinese in my case) come through just fine. But the ones that don't have those irritating boxes in place of the correct glyphs. /Lew --- Lew Perin / perin@acm.org http://www.panix.com/~perin/babelcarp.html ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-12 15:39 ` Lewis Perin @ 2009-06-12 16:48 ` B. T. Raven 2009-06-12 17:45 ` Lewis Perin 2009-06-12 17:53 ` Xah Lee 0 siblings, 2 replies; 32+ messages in thread From: B. T. Raven @ 2009-06-12 16:48 UTC (permalink / raw) To: help-gnu-emacs Lewis Perin wrote: > ken <gebser@mousecar.com> writes: > >> [...] >> Lewis, >> >> Thanks for posting. It's lonely out there when you're the only one with >> a particular problem. > > The few, the proud... > >> To make sure we're suffering the same cyber-indignity, here's the >> scenario as I see it (from an older version of emacs running on >> Linux): >> >> 0) Some others and myself want to include some non-English characters in >> a file being edited in emacs. Problems arise, however: >> >> 1) In a buffer which is already utf-8 encoded, I set the appropriate >> input method, type in the desired characters. They display just peachy >> and there is happiness in EmacsLand. >> >> 2) I save the buffer to a file, then close the buffer. >> >> 3) I visit the same file (i.e., load it again into emacs). Because it >> has <!-- -*- coding: utf-8; -*- --> as the first line, it opens >> utf-8 encoded. This is confirmed by the presence of a 'u' as the second >> character in the status bar. > > I haven't been inserting that special first line. > >> 4) The text in the buffer displays fine, except that in place of each of >> those non-English characters is a little empty box. With the cursor on >> one of those boxes, an 'a' with a horizontal bar above it, doing "C-x >> =", emacs returns "Char: ā (01210041, 331809, 0x51021, file ...)". >> (While, in emacs the character after "Char:" is a little box, if I load >> this same file into Firefox, that same character appears as it should, >> as an 'a' with a horizontal bar above it. How it appears in your email >> client will depend upon your email client.) > > My situation differs in that most of the non-ASCII characters (Chinese > in my case) come through just fine. But the ones that don't have > those irritating boxes in place of the correct glyphs. > > /Lew > --- > Lew Perin / perin@acm.org > http://www.panix.com/~perin/babelcarp.html I wouldn't be surprised if the gaps and overlaps in the CJK ranges of glyphs weren't so complicated that many characters from the following encodings may not be included in utf-8, especially if they are not precomposed. Try some of these encodings to see if some of the empty boxes are resolved into characters: chinese-big5 chinese-hz chinese-iso-7bit chinese-iso-8bit chinese-iso-8bit-with-esc cn-big5 cn-gb cn-gb-2312 iso-2022-cjk iso-2022-cn iso-2022-cn-ext Also it might help to install a fontset rather than depending on a single font to represent all these characters. Unfortunately I can't help with that. I am on w32 and I don't even know whether fontsets can be used in Emacs on that build. Ed ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-12 16:48 ` B. T. Raven @ 2009-06-12 17:45 ` Lewis Perin 2009-06-12 17:53 ` Xah Lee 1 sibling, 0 replies; 32+ messages in thread From: Lewis Perin @ 2009-06-12 17:45 UTC (permalink / raw) To: help-gnu-emacs "B. T. Raven" <nihil@nihilo.net> writes: > Lewis Perin wrote: > > ken <gebser@mousecar.com> writes: > > > >> [...] > >> Lewis, > >> > >> Thanks for posting. It's lonely out there when you're the only one with > >> a particular problem. > > The few, the proud... > > > >> To make sure we're suffering the same cyber-indignity, here's the > >> scenario as I see it (from an older version of emacs running on > >> Linux): > >> > >> 0) Some others and myself want to include some non-English characters in > >> a file being edited in emacs. Problems arise, however: > >> > >> 1) In a buffer which is already utf-8 encoded, I set the appropriate > >> input method, type in the desired characters. They display just peachy > >> and there is happiness in EmacsLand. > >> > >> 2) I save the buffer to a file, then close the buffer. > >> > >> 3) I visit the same file (i.e., load it again into emacs). Because it > >> has <!-- -*- coding: utf-8; -*- --> as the first line, it opens > >> utf-8 encoded. This is confirmed by the presence of a 'u' as the second > >> character in the status bar. > > I haven't been inserting that special first line. > > > >> 4) The text in the buffer displays fine, except that in place of each of > >> those non-English characters is a little empty box. With the cursor on > >> one of those boxes, an 'a' with a horizontal bar above it, doing "C-x > >> =", emacs returns "Char: ā (01210041, 331809, 0x51021, file ...)". > >> (While, in emacs the character after "Char:" is a little box, if I load > >> this same file into Firefox, that same character appears as it should, > >> as an 'a' with a horizontal bar above it. How it appears in your email > >> client will depend upon your email client.) > > My situation differs in that most of the non-ASCII characters > > (Chinese > > in my case) come through just fine. But the ones that don't have > > those irritating boxes in place of the correct glyphs. > > I wouldn't be surprised if the gaps and overlaps in the CJK ranges of > glyphs weren't so complicated that many characters from the following > encodings may not be included in utf-8, Sorry, I'm not sure what you mean by "may not be included in utf-8": do you mean utf-8 the standard, or do you mean Emacs's implementation of it? The characters I'm talking about are definitely in Unicode. > especially if they are not precomposed. This I don't really understand, either, I'm afraid. Might this explain why I can see the glyph for ni3 when I'm composing Chinese in Emacs using the chinese-tonepy-punct input method but can't see it when the saved file is read by Emacs? > Try some of these encodings to see if some of the empty boxes are > resolved into characters: > [...] > cn-gb-2312 I created a little file with my bête noire character using that encoding and saved it. Reverting the file with that encoding, I did see all the characters. > Also it might help to install a fontset rather than depending on a > single font to represent all these characters. Unfortunately I can't > help with that. I am on w32 and I don't even know whether fontsets can > be used in Emacs on that build. Windows R Us, too. /Lew --- Lew Perin / perin@acm.org http://www.panix.com/~perin/babelcarp.html ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-12 16:48 ` B. T. Raven 2009-06-12 17:45 ` Lewis Perin @ 2009-06-12 17:53 ` Xah Lee 2009-06-12 20:59 ` Lennart Borgman ` (2 more replies) 1 sibling, 3 replies; 32+ messages in thread From: Xah Lee @ 2009-06-12 17:53 UTC (permalink / raw) To: help-gnu-emacs On Jun 12, 7:54 am, ken <geb...@mousecar.com> wrote: > B) It would be helpful if the code which does the decoding of a file and > renders it into the buffer display, if that part of it would throw an > error message when it encounters a character it doesn't know how to > display, i.e., when a little box character is displayed. After all, > isn't it an error when a little box is displayed in lieu of the correct > character? Possible error messages would be something like: "decoding > process can't find /path/to/charset.file" or "decoding process doesn't > have requisite permission to read /path/to/charset.file" or "invalid > character: [hex/decimal value]" or other. some thought process in the above is not correct. In general, a program just read a text file as a byte stream, and using a encoding scheme to interpret it, the program has little way to determine if the encoding is correct. Theoretically, it could check with common phrases but that is generally not done by the software we use daily. (some program does scan text guess a encoding, but not always correct) here's some general technical issues and experiences about using foreign chars: • the software needs to know what encoding & char set is used in order to interpret the binary stream. If you don't specifically set it, typically it assumes ascii or some iso latin char set. (of software in USA anyway) • today's software generally don't contain any extra heuristics to check if the encoding used is actually correct. There is no technical way to check that in general. It can be only heuristics, i.e. guesses. e.g. browsers will often guess when reading a page that doesn't have encoding info. • even when the encoding is correct, the software needs all the proper fonts to display it. Or, rely on some font-replacement technology, e.g. when it finds a char which the current font doesn't have, it uses another font for that char. (in the case of Chinese, this often results in ugly text of mixed char style, some appear thin, some thick, some squarely (like sans-serif), some calligraphic, some bit- mapped) Windows OS and OS X both has font-replacement technology, as well as all the major browsers for both os x and windows. This font replacement technology, however, is not perfect. So, sometimes you'll see squares or question marks here or there, especially on some chars that's not widely used (e.g. math symbols in unicode, double right arrow, tech symbols such as Apple's command key and option key, triple asterisk, etc.). • when writing a file, the software needs to use a encoding to write it. Just like reading, if you haven't explicitly set it, typically it uses ascii or some iso latin char set, in most western lang countries. • when you use a software to open a text but with wrong encoding info, the result is gibberish. the above applies not just to emacs, but applies to all apps. Some commentary are based on my experiences with browsers, web pages, word processors, online forums, mailing list, email apps, instant messaging chat apps, etc, on both mac and windows. technically, the issues involved is char set, encoding, font. ( the concept of char set and encoding are independent but is often mixed together in a spec, esp earlier ones). i use mixed chinese & english in single file often and in both mac os x and windows. They work well. On the mac, my emacs is version 22.x. On win, it is emacs23. My encoding in emacs is set to utf-8. I've wrote a lot about these issues, the following docs might be helpful. • Emacs and Unicode Tips http://xahlee.org/emacs/emacs_n_unicode.html • Unicode Characters Example http://xahlee.org/Periodic_dosage_dir/t1/20040505_unicode.html • the Journey of a Foreign Character thru Internet http://xahlee.org/Periodic_dosage_dir/t2/non-ascii_journey.html • Converting a File's Encoding with Python http://xahlee.org/perl-python/charset_encoding.html • Character Sets and Encoding in HTML http://xahlee.org/js/html_chars.html • The Complexity And Tedium of Software Engineering (parts about unicode problem with unison and emacs) http://xahlee.org/UnixResource_dir/writ/programer_frustration.html • Mac and Windows File Conversion (parts about unicode filename issues) http://xahlee.org/mswin/mac_windows_file_conv.html • Windows Font and Unicode http://xahlee.org/mswin/windows_font_unicode.html the above article contain tens of links to Wikipedia in appropriate places. Wikipedia has massive info in digestible form about these issues, one can spend a month on the above foreign char issues ... for some examples of mixed chinese & english text i work with, see: • Chinese Core Simplified Chars http://xahlee.org/lojban/simplified_chars.html • Ethology, Ethnology, and Lyrics http://xahlee.org/Periodic_dosage_dir/sanga_pemci/sanga_pemci.html Xah ∑ http://xahlee.org/ ☄ On Jun 12, 9:48 am, "B. T. Raven" <ni...@nihilo.net> wrote: > I wouldn't be surprised if the gaps and overlaps in the CJK ranges of > glyphs weren't so complicated that many characters from the following > encodings may not be included in utf-8, especially if they are not > precomposed. Try some of these encodings to see if some of the empty > boxes are resolved into characters: > > chinese-big5 > chinese-hz > chinese-iso-7bit > chinese-iso-8bit > chinese-iso-8bit-with-esc > cn-big5 > cn-gb > cn-gb-2312 > iso-2022-cjk > iso-2022-cn > iso-2022-cn-ext most chinese encodings are subset or identical to unicode's charset. In particular, the current, mostly widely used chinese charset the GB 18030, actually is just unicode. see http://en.wikipedia.org/wiki/GB_18030 Note also, that means china's GB 18030 contain the entirely of traditional chars in unicode too. (though, i don't know about how big5 relates to unicode ) the list you gave above is from emacs? emacs's list always seems strange to me... haven't really looked into it. maybe emacs's list is really encompassing of all encoding that've existed, but it also could be just screwed up like many open source things. For example, it invents its own names by mixing up char set encoding with concepts of EOL convention. btw, who actually coded the low down levels of char encoding in emacs? e.g. especially unicode, since it came after richard stallman still doing the bulk of emacs. That person should be admirable. lol. Xah ∑ http://xahlee.org/ ☄ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-12 17:53 ` Xah Lee @ 2009-06-12 20:59 ` Lennart Borgman 2009-06-12 22:23 ` ken [not found] ` <mailman.536.1244845400.2239.help-gnu-emacs@gnu.org> 2 siblings, 0 replies; 32+ messages in thread From: Lennart Borgman @ 2009-06-12 20:59 UTC (permalink / raw) To: Xah Lee; +Cc: help-gnu-emacs On Fri, Jun 12, 2009 at 7:53 PM, Xah Lee<xahlee@gmail.com> wrote: > the list you gave above is from emacs? emacs's list always seems > strange to me... haven't really looked into it. maybe emacs's list is > really encompassing of all encoding that've existed, but it also could > be just screwed up like many open source things. I do not know these things, but from the discussions on Emacs devel it looks like those coding it in Emacs knows it very well. > For example, it > invents its own names by mixing up char set encoding with concepts of > EOL convention. It is a technical consideration. Hopefully it does not confuse anyone. > btw, who actually coded the low down levels of char encoding in emacs? > e.g. especially unicode, since it came after richard stallman still > doing the bulk of emacs. That person should be admirable. lol. Please look in the change log files. (I think you need to check out the sources to see those. Or look in the web interface for dito of course.) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-12 17:53 ` Xah Lee 2009-06-12 20:59 ` Lennart Borgman @ 2009-06-12 22:23 ` ken [not found] ` <e01d8a50906121527k5e77f5abj8c2c44f62f85e537@mail.gmail.com> [not found] ` <mailman.536.1244845400.2239.help-gnu-emacs@gnu.org> 2 siblings, 1 reply; 32+ messages in thread From: ken @ 2009-06-12 22:23 UTC (permalink / raw) To: GNU Emacs List On 06/12/2009 01:53 PM Xah Lee wrote: > On Jun 12, 7:54 am, ken <geb...@mousecar.com> wrote: >> B) It would be helpful if the code which does the decoding of a file and >> renders it into the buffer display, if that part of it would throw an >> error message when it encounters a character it doesn't know how to >> display, i.e., when a little box character is displayed. After all, >> isn't it an error when a little box is displayed in lieu of the correct >> character? Possible error messages would be something like: "decoding >> process can't find /path/to/charset.file" or "decoding process doesn't >> have requisite permission to read /path/to/charset.file" or "invalid >> character: [hex/decimal value]" or other. > > some thought process in the above is not correct. Yet emacs puts a little box in the place of a character it cannot find (or, per your explanation) possibly confused about. The fact remains that the little box is not a correct rendering of the code. It is an error... at least it is for me, because that's not what I typed in. So it is an error. As an error, there should be a corresponding error message, hopefully one (or more) which would help diagnose the problem. It seems obvious that, given the long thread on this issue with no resolution, we could use some help-- like an error message-- which would help in diagnosis. Thanks for the information and the links though. > > .... ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <e01d8a50906121527k5e77f5abj8c2c44f62f85e537@mail.gmail.com>]
[parent not found: <4A32E6F6.5080501@mousecar.com>]
[parent not found: <E1MFKab-0000GU-Dg@fencepost.gnu.org>]
* Re: utf8 char display in buffer [not found] ` <E1MFKab-0000GU-Dg@fencepost.gnu.org> @ 2009-06-13 12:30 ` ken 0 siblings, 0 replies; 32+ messages in thread From: ken @ 2009-06-13 12:30 UTC (permalink / raw) To: Eli Zaretskii, GNU Emacs List, emacs-devel On 06/13/2009 12:11 AM Eli Zaretskii wrote: >> .... > > Please provide the output of "C-u C-x =" on these characters, both > when they are displayed correctly and when they are displayed as empty > boxes. In a similar post on the same thread Eli Zaretskii wrote: > Please post here the full output of "C-u C-x =" (from a buffer popped > up by Emacs) for these characters, both when you type them using the > appropriate input method and they are displayed correctly (as in 1) > above), and when you see them as empty boxes after revisiting the > file. The differences between these two cases should give you a hint > what is wrong; if not, someone else here might have ideas. Eli, thanks for your response. Here it is: ^[$-1 ¡ is 'a' with a horizontal bar over it. On first inputting it (after doing "set-input-method latin-4-postfix" and before changing the input method to anything else), it appears correctly and "C-u C-x =" yields: ============================================= character: ^[$-1 ¡ (05140, 2656, 0xa60) charset: latin-iso8859-4 (Right-Hand Part of Latin Alphabet 4 (ISO/IEC 8859-4): ISO-IR-110) code point: 96 syntax: word category: l:Latin buffer code: 0x84 0xE0 file code: 0xC4 0x81 (encoded by coding system mule-utf-8-unix) font: -ETL-Fixed-Medium-R-Normal--16-160-72-72-C-80-ISO8859-4 ============================================= When I reload the file (revisit the file), the same character is replaced with a little box. Doing "C-u C-x =" here yields: ============================================= character: ^[$-1 ¡ (01210041, 331809, 0x51021) charset: mule-unicode-0100-24ff (Unicode characters of the range U+0100..U+24FF.) code point: 32 33 syntax: word category: l:Latin buffer code: 0x9C 0xF4 0xA0 0xA1 file code: 0xC4 0x81 (encoded by coding system mule-utf-8-unix) font: -- none -- ============================================= Note: For some reason, possibly related, had difficulty copying the above text from emacs into clipboard (i.e., "M-w" didn't do anything), so had to use a workaround. It seems that this workaround altered the character in question, the one above following each of the two instances of "character:". As for the meaning of the two outputs above, all that I can confidently glean is that, if I want to use non-English characters in emacs, I have to be an expert emacs developer. :) ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <mailman.536.1244845400.2239.help-gnu-emacs@gnu.org>]
* Re: utf8 char display in buffer [not found] ` <mailman.536.1244845400.2239.help-gnu-emacs@gnu.org> @ 2009-06-13 0:35 ` Xah Lee 0 siblings, 0 replies; 32+ messages in thread From: Xah Lee @ 2009-06-13 0:35 UTC (permalink / raw) To: help-gnu-emacs On Jun 12, 3:23 pm, ken <geb...@mousecar.com> wrote: > On 06/12/2009 01:53 PM Xah Lee wrote: > > > On Jun 12, 7:54 am, ken <geb...@mousecar.com> wrote: > >> B) It would be helpful if the code which does the decoding of a file and > >> renders it into the buffer display, if that part of it would throw an > >> error message when it encounters a character it doesn't know how to > >> display, i.e., when a little box character is displayed. After all, > >> isn't it an error when a little box is displayed in lieu of the correct > >> character? Possible error messages would be something like: "decoding > >> process can't find /path/to/charset.file" or "decoding process doesn't > >> have requisite permission to read /path/to/charset.file" or "invalid > >> character: [hex/decimal value]" or other. > > > some thought process in the above is not correct. > > Yet emacs puts a little box in the place of a character it cannot find > (or, per your explanation) possibly confused about. The fact remains > that the little box is not a correct rendering of the code. It is an > error... at least it is for me, because that's not what I typed in. So > it is an error. As an error, there should be a corresponding error > message, hopefully one (or more) which would help diagnose the problem. > It seems obvious that, given the long thread on this issue with no > resolution, we could use some help-- like an error message-- which would > help in diagnosis. > > Thanks for the information and the links though. i think displaying a error for each char that emacs cannot find a font for is just not feasible. The app can't know whether it used the right encoding. And even if the encoding used is correct, it can't deal with possible missing fonts in some of the characters in the char set. i don't have experience in this, but imagine, when a app gets a byte stream, and with a given charset/encoding. With that, it can decode byte length to map to the code points in the char set. (e.g. utf-8, utf-16, both don't have fixed byte-length for chars) After that done, you get a sequence of a code points (i.e. a sequence of integers). At this point, given a integer, you need to map this integere to a character in a font. There are many issues here... a font i guess is a set of glyphs... ultimately a set of integers. I'm not sure what sort of spec or standard specifies what each integer means (i.e. support your app now has a integer that represents B. Now suppose your app is set to use font Aria. Now, Aria is a set of integers, but by what standard that says what integer is B?)... Part of this step is what happens when Aria don't have that character. (i'm guessing a font also has data about what character set it contains...) But in anycase, finally we'll have a B from font Arial. Then it goes thru the whole display process... overall i think the technology we have today that actually display fonts and unicode text etc are extremely complex, not to mention vector based fonts and anti-aliasing and font-substitution etc techs. some interesting read here: http://en.wikipedia.org/wiki/Computer_font http://en.wikipedia.org/wiki/Anti-aliasing http://en.wikipedia.org/wiki/Font_rasterization http://en.wikipedia.org/wiki/Subpixel_rendering http://en.wikipedia.org/wiki/Font-substitution for most modern apps, like browsers, i think they all call OS's APIs to handle it. Some glimps over emacs dev list seems to suggest that emacs implements its own display system... on one hand it's bad because emacs misses out using all modern techs developed in 2 decades by Apple or Adobe or Microsoft, or some Open Source's work, on the other hand it is admirable in that it does it on its own... sorry am rambling a bit. You are right that the bottom line is that some things just rendered as squares and is a problem. Though, i wanted to say that my point was that it is unfeasible to issue a error for missing fonts or miss-interpretation of the encodings. Part of this is because theoretically there's no way to know that encoding chosen is correct. Part is because in practice missing font or bad chosen encoding is very common. If we all stick with ascii, everything is pretty good. If we stick to western langs, things are still not too bad. But once you have chinese, japanese, korean alphabets, or the ocational use of the many math symbols and greek letters, or adding cyrillic/russian alphabets or arabian alphabets ... the chances of missing font or missing encoding info is very high. i think a large part of the problem is that char set and encoding info is not part of the file. Things are getting better in the past decade with mime type and unicode standard. But give a byte stream, after being lucky of able to know it is text, there's still little way to know how to interpret it. The char set and encoding meta data often gets lost, implementation are often not robust, font for multi-lang usually are not there, and font-substitution tech just started. (according to Wikipedia, IE before 7 does not even have font substitution (which means, you really need such beast as “unicode font”, namely a font that contains some tens or hundreds thousands of glyphs)) i think all these issue only started to get addressed in the past decade since the globalization partly due to internet. Before, English speakers just stick with ascii and that's pretty sufficient. Each western lang region stick with their particular encoding for a few special chars in their alphabet. Only when things started to mix they get more complex, and now with Chinese & japanese etc. With unicode, the use of math symbols also becomes more common. Before that, it's just ascii markup... speaking of this. Emacs and FSF docs still stick with 1980s's `quote hack', and arrows like this -> => ... very extremely stupid. Of course i filed polite bug reports, and have argued here too heated, but basically fallen to no ears. Somethings just is impossible to progress in the FSF world. Xah ∑ http://xahlee.org/ ☄ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer [not found] ` <mailman.522.1244818530.2239.help-gnu-emacs@gnu.org> 2009-06-12 15:39 ` Lewis Perin @ 2009-06-12 17:27 ` Xah Lee 2009-06-12 19:30 ` Lewis Perin ` (2 more replies) 1 sibling, 3 replies; 32+ messages in thread From: Xah Lee @ 2009-06-12 17:27 UTC (permalink / raw) To: help-gnu-emacs On Jun 12, 7:54 am, ken <geb...@mousecar.com> wrote: > B) It would be helpful if the code which does the decoding of a file and > renders it into the buffer display, if that part of it would throw an > error message when it encounters a character it doesn't know how to > display, i.e., when a little box character is displayed. After all, > isn't it an error when a little box is displayed in lieu of the correct > character? Possible error messages would be something like: "decoding > process can't find /path/to/charset.file" or "decoding process doesn't > have requisite permission to read /path/to/charset.file" or "invalid > character: [hex/decimal value]" or other. some thought process in the above is not correct. In general, a program just read a text file as a byte stream, and using a encoding scheme to interprete it, the program has little way to determine if the encoding is correct. Theoretically, it could check with command phrases but that is generally not done by the software we use daily. (some program does scan text guess a encoding, but not always correct) here's some general technical issues and experiences about using foreign chars: • the software needs to know what encoding & char set is used in order to interprete the binary stream. If you don't specifically set it, typically it assumes ascii or some iso latin char set. (of software in USA anyway) • today's software generally don't contain any extra heuistics to check if the encoding used is actually correct. There is no technical way to check that in general. It can be only heuristics, i.e. guesses. e.g. browsers will often guess when reading a page that doesn't have encoding info. • even when the encoding is correct, the software needs all the proper fonts to display it. Or, rely on some font-replacement technology, e.g. when it finds a char which the current font doesn't have, it uses another font for that char. (in the case of Chinese, this often results in ugly text of mixed char style, some appear thin, some thick, some squarly (like sans-serif), some caligraphic, some bitmapped) Windows OS and OS X both has font-replacement technology, as well as all the major browsers for both os x and windows. This font replacement technology, however, is not perfect. So, sometimes you'll see squares or question marks here or there, especially on some chars that's not widely used (e.g. math symbols in unicode, double right arrow, tech symbols such as Apple's command key and option key, triple asterisk, etc.). • when writing a file, the software needs to use a encoding to write it. Just like reading, if you havn't explicitly set it, typically it uses ascii or some iso latin char set, in most western lang countries. • when you use a software to open a text but with wrong encoding info, the result is gibberish. the above applies not just to emacs, but applies to all apps. Some commentary are based on my experiences with browsers, web pages, word processors, online forums, mailing list, email apps, instant messaging chat apps, etc, on both mac and windows. technically, the issues involved is char set, encoding, font. ( the concept of char set and encoding are independent but is often mixed together in a spec, esp earlier ones). i use mixed chinese & english in single file often and in both mac os x and windows. They work well. On the mac, my emacs is version 22.x. On win, it is emacs23. My encoding in emacs is set to utf-8. I've wrote a lot about these issues, the following docs might be helpful. • Emacs and Unicode Tips http://xahlee.org/emacs/emacs_n_unicode.html • Unicode Characters Example http://xahlee.org/Periodic_dosage_dir/t1/20040505_unicode.html • the Journey of a Foreign Character thru Internet http://xahlee.org/Periodic_dosage_dir/t2/non-ascii_journey.html • Converting a File's Encoding with Python http://xahlee.org/perl-python/charset_encoding.html • Character Sets and Encoding in HTML http://xahlee.org/js/html_chars.html • The Complexity And Tedium of Software Engineering (parts about unicode problem with unison and emacs) http://xahlee.org/UnixResource_dir/writ/programer_frustration.html • Mac and Windows File Conversion (parts about unicode filename issues) http://xahlee.org/mswin/mac_windows_file_conv.html • Windows Font and Unicode http://xahlee.org/mswin/windows_font_unicode.html the above article contain tens of links to Wikipedia in appropriate places. Wikipedia has massive info in digestable form about these issues, one can spend a month on the above foreign char issues ... for some examples of mixed chinese & english text i work with, see: • Chinese Core Simplified Chars http://xahlee.org/lojban/simplified_chars.html • Ethology, Ethnology, and Lyrics http://xahlee.org/Periodic_dosage_dir/sanga_pemci/sanga_pemci.html Xah ∑ http://xahlee.org/ ☄ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-12 17:27 ` Xah Lee @ 2009-06-12 19:30 ` Lewis Perin 2009-06-12 19:43 ` Xah Lee 2009-06-12 20:56 ` B. T. Raven 2009-06-13 20:35 ` Lewis Perin 2 siblings, 1 reply; 32+ messages in thread From: Lewis Perin @ 2009-06-12 19:30 UTC (permalink / raw) To: help-gnu-emacs Xah Lee <xahlee@gmail.com> writes: > [...] > i use mixed chinese & english in single file often and in both mac os > x and windows. They work well. On the mac, my emacs is version 22.x. > On win, it is emacs23. My encoding in emacs is set to utf-8. > > I've wrote a lot about these issues, the following docs might be > helpful. > [...] I'll assume you have no trouble with ni3, the normal second person pronoun, and have a look at your collected works. Thanks! /Lew --- Lew Perin / perin@acm.org http://www.panix.com/~perin/babelcarp.html ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-12 19:30 ` Lewis Perin @ 2009-06-12 19:43 ` Xah Lee 0 siblings, 0 replies; 32+ messages in thread From: Xah Lee @ 2009-06-12 19:43 UTC (permalink / raw) To: help-gnu-emacs On Jun 12, 12:30 pm, Lewis Perin <pe...@panix.com> wrote: > I'll assume you have no trouble with ni3, the normal second person > pronoun, and have a look at your collected works. Thanks! yeah, no prob with ni3 hao3 你好. This is written in emacs then pasted to google groups. Xah ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-12 17:27 ` Xah Lee 2009-06-12 19:30 ` Lewis Perin @ 2009-06-12 20:56 ` B. T. Raven 2009-06-13 16:16 ` Xah Lee 2009-06-13 20:35 ` Lewis Perin 2 siblings, 1 reply; 32+ messages in thread From: B. T. Raven @ 2009-06-12 20:56 UTC (permalink / raw) To: help-gnu-emacs Xah Lee wrote: > On Jun 12, 7:54 am, ken <geb...@mousecar.com> wrote: >> B) It would be helpful if the code which does the decoding of a file and >> renders it into the buffer display, if that part of it would throw an >> error message when it encounters a character it doesn't know how to >> display, i.e., when a little box character is displayed. After all, >> isn't it an error when a little box is displayed in lieu of the correct >> character? Possible error messages would be something like: "decoding >> process can't find /path/to/charset.file" or "decoding process doesn't >> have requisite permission to read /path/to/charset.file" or "invalid >> character: [hex/decimal value]" or other. > > some thought process in the above is not correct. > > In general, a program just read a text file as a byte stream, and > using a encoding scheme to interprete it, the program has little way > to determine if the encoding is correct. Theoretically, it could check > with command phrases but that is generally not done by the software we > use daily. (some program does scan text guess a encoding, but not > always correct) > > here's some general technical issues and experiences about using > foreign chars: > > • the software needs to know what encoding & char set is used in order > to interprete the binary stream. If you don't specifically set it, > typically it assumes ascii or some iso latin char set. (of software in > USA anyway) > > • today's software generally don't contain any extra heuistics to > check if the encoding used is actually correct. There is no technical > way to check that in general. It can be only heuristics, i.e. guesses. > e.g. browsers will often guess when reading a page that doesn't have > encoding info. > > • even when the encoding is correct, the software needs all the proper > fonts to display it. Or, rely on some font-replacement technology, > e.g. when it finds a char which the current font doesn't have, it uses > another font for that char. (in the case of Chinese, this often > results in ugly text of mixed char style, some appear thin, some > thick, some squarly (like sans-serif), some caligraphic, some > bitmapped) Windows OS and OS X both has font-replacement technology, > as well as all the major browsers for both os x and windows. This font > replacement technology, however, is not perfect. So, sometimes you'll > see squares or question marks here or there, especially on some chars > that's not widely used (e.g. math symbols in unicode, double right > arrow, tech symbols such as Apple's command key and option key, triple > asterisk, etc.). > > • when writing a file, the software needs to use a encoding to write > it. Just like reading, if you havn't explicitly set it, typically it > uses ascii or some iso latin char set, in most western lang countries. > > • when you use a software to open a text but with wrong encoding info, > the result is gibberish. > > the above applies not just to emacs, but applies to all apps. Some > commentary are based on my experiences with browsers, web pages, word > processors, online forums, mailing list, email apps, instant messaging > chat apps, etc, on both mac and windows. > > technically, the issues involved is char set, encoding, font. ( the > concept of char set and encoding are independent but is often mixed > together in a spec, esp earlier ones). > > i use mixed chinese & english in single file often and in both mac os > x and windows. They work well. On the mac, my emacs is version 22.x. > On win, it is emacs23. My encoding in emacs is set to utf-8. > > I've wrote a lot about these issues, the following docs might be > helpful. > > • Emacs and Unicode Tips > http://xahlee.org/emacs/emacs_n_unicode.html > > • Unicode Characters Example > http://xahlee.org/Periodic_dosage_dir/t1/20040505_unicode.html > > • the Journey of a Foreign Character thru Internet > http://xahlee.org/Periodic_dosage_dir/t2/non-ascii_journey.html > > • Converting a File's Encoding with Python > http://xahlee.org/perl-python/charset_encoding.html > > • Character Sets and Encoding in HTML > http://xahlee.org/js/html_chars.html > > • The Complexity And Tedium of Software Engineering (parts about > unicode problem with unison and emacs) > http://xahlee.org/UnixResource_dir/writ/programer_frustration.html > > • Mac and Windows File Conversion (parts about unicode filename > issues) > http://xahlee.org/mswin/mac_windows_file_conv.html > > • Windows Font and Unicode > http://xahlee.org/mswin/windows_font_unicode.html > > the above article contain tens of links to Wikipedia in appropriate > places. Wikipedia has massive info in digestable form about these > issues, one can spend a month on the above foreign char issues ... > > for some examples of mixed chinese & english text i work with, see: > > • Chinese Core Simplified Chars > http://xahlee.org/lojban/simplified_chars.html > > • Ethology, Ethnology, and Lyrics > http://xahlee.org/Periodic_dosage_dir/sanga_pemci/sanga_pemci.html > > Xah > ∑ http://xahlee.org/ > > ☄ Totally OT but prima facie the mosting interesting title is the last. Unfortunately I couldn't grok what ethology (the "anthropology" of animals)had to do with it unless the critters that emit "The Masochistic Cries of Lovelorn Females" are to be considered as less than human. I notice that Salt-n-Pepa's sweet little ditty (Don't want no S.D.M.) is missing from the list, but maybe that's more sadistic than masochistic; maybe it belongs in the Quagmire. ;-) Sexology is a bona fide area of inquiry pioneered by Kinsey et al. but sexualogy is not an English word nor (I keep my fingers crossed) will it ever become one. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-12 20:56 ` B. T. Raven @ 2009-06-13 16:16 ` Xah Lee 0 siblings, 0 replies; 32+ messages in thread From: Xah Lee @ 2009-06-13 16:16 UTC (permalink / raw) To: help-gnu-emacs On Jun 12, 1:56 pm, "B. T. Raven" <ni...@nihilo.net> wrote: > > • Ethology, Ethnology, and Lyrics > > http://xahlee.org/Periodic_dosage_dir/sanga_pemci/sanga_pemci.html > Totally OT but prima facie the mosting interesting title is the last. > Unfortunately I couldn't grok what ethology (the "anthropology" of > animals)had to do with it unless the critters that emit "The Masochistic > Cries of Lovelorn Females" are to be considered as less than human. I > notice that Salt-n-Pepa's sweet little ditty (Don't want no S.D.M.) is > missing from the list, but maybe that's more sadistic than masochistic; > maybe it belongs in the Quagmire. ;-) Sexology is a bona fide area of > inquiry pioneered by Kinsey et al. but sexualogy is not an English word > nor (I keep my fingers crossed) will it ever become one. sexualogy = sexology + sexuality. ^_^ ok, now it reads “... respect to ethology and sexuality.”. Simpler and more fitting. Xah ∑ http://xahlee.org/ ☄ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-12 17:27 ` Xah Lee 2009-06-12 19:30 ` Lewis Perin 2009-06-12 20:56 ` B. T. Raven @ 2009-06-13 20:35 ` Lewis Perin 2009-06-14 11:47 ` ken 2 siblings, 1 reply; 32+ messages in thread From: Lewis Perin @ 2009-06-13 20:35 UTC (permalink / raw) To: help-gnu-emacs Xah Lee <xahlee@gmail.com> writes: > [...] > i use mixed chinese & english in single file often and in both mac os > x and windows. They work well. On the mac, my emacs is version 22.x. > On win, it is emacs23. My encoding in emacs is set to utf-8. Thanks for mentioning v. 23. I just downloaded it, despite my misgivings about life on the bleeding edge, and my problem with some Chinese UTF-8 characters' glyphs turning to boxes when the file is reverted seems to have vanished. /Lew --- Lew Perin / perin@acm.org http://www.panix.com/~perin/babelcarp.html ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-13 20:35 ` Lewis Perin @ 2009-06-14 11:47 ` ken 2009-06-15 7:28 ` Bernardo 0 siblings, 1 reply; 32+ messages in thread From: ken @ 2009-06-14 11:47 UTC (permalink / raw) To: Lewis Perin; +Cc: help-gnu-emacs On 06/13/2009 04:35 PM Lewis Perin wrote: > Xah Lee <xahlee@gmail.com> writes: > >> [...] >> i use mixed chinese & english in single file often and in both mac os >> x and windows. They work well. On the mac, my emacs is version 22.x. >> On win, it is emacs23. My encoding in emacs is set to utf-8. > > Thanks for mentioning v. 23. I just downloaded it, despite my > misgivings about life on the bleeding edge, and my problem with some > Chinese UTF-8 characters' glyphs turning to boxes when the file is > reverted seems to have vanished. > > /Lew > --- > Lew Perin / perin@acm.org > http://www.panix.com/~perin/babelcarp.html Lew (or anyone), Where did you find v.23? The only place I'm seeing is cvs, <http://cvs.savannah.gnu.org/viewvc/emacs/?root=emacs>. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer 2009-06-14 11:47 ` ken @ 2009-06-15 7:28 ` Bernardo 0 siblings, 0 replies; 32+ messages in thread From: Bernardo @ 2009-06-15 7:28 UTC (permalink / raw) To: help-gnu-emacs http://alpha.gnu.org/gnu/emacs/pretest/ ken said the following on 14/06/09 21:47: > > Lew (or anyone), > > Where did you find v.23? The only place I'm seeing is cvs, > <http://cvs.savannah.gnu.org/viewvc/emacs/?root=emacs>. > > > > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: utf8 char display in buffer [not found] <mailman.227.1244485995.2239.help-gnu-emacs@gnu.org> ` (2 preceding siblings ...) 2009-06-08 20:43 ` B. T. Raven @ 2009-06-11 12:03 ` Teemu Likonen 3 siblings, 0 replies; 32+ messages in thread From: Teemu Likonen @ 2009-06-11 12:03 UTC (permalink / raw) To: help-gnu-emacs On 2009-06-08 14:33 (-0400), ken wrote: > I already use a few utf8 characters in emacs (and in web pages), but > recently needed to use a couple more. One is an 'a' with a horizontal > line above it, the other an 'i' with a [horizontal] line above it. How > do I input these into a buffer? Let’s add one more nice way to insert Unicode chars: “rfc1345” input method. It’s an input method for Unicode characters using mnemonics. Examples: &a- = ā &i- = ī &W* = Ω &"6 = “ &"9 = ” For more info: C-h I rfc1345 RET ^ permalink raw reply [flat|nested] 32+ messages in thread
* utf8 char display in buffer @ 2009-06-08 18:33 ken 0 siblings, 0 replies; 32+ messages in thread From: ken @ 2009-06-08 18:33 UTC (permalink / raw) To: GNU Emacs List Hey, group, I already use a few utf8 characters in emacs (and in web pages), but recently needed to use a couple more. One is an 'a' with a horizontal line above it, the other an 'i' with a vertical line above it. How do I input these into a buffer? tia, ken -- "To make an apple pie from scratch, first create the universe." -- Carl Sagan ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2009-06-15 7:28 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <mailman.227.1244485995.2239.help-gnu-emacs@gnu.org> 2009-06-08 19:10 ` utf8 char display in buffer Teemu Likonen 2009-06-08 19:52 ` Xah Lee 2009-06-09 10:52 ` ken 2009-06-08 20:43 ` B. T. Raven 2009-06-08 20:49 ` B. T. Raven 2009-06-08 22:49 ` ken 2009-06-09 10:24 ` ken [not found] ` <mailman.289.1244543082.2239.help-gnu-emacs@gnu.org> 2009-06-09 13:03 ` B. T. Raven 2009-06-09 14:51 ` ken [not found] ` <mailman.297.1244559110.2239.help-gnu-emacs@gnu.org> 2009-06-10 1:34 ` B. T. Raven 2009-06-10 14:03 ` Lewis Perin 2009-06-11 3:21 ` B. T. Raven 2009-06-12 14:54 ` ken 2009-06-13 3:30 ` Eli Zaretskii [not found] ` <mailman.522.1244818530.2239.help-gnu-emacs@gnu.org> 2009-06-12 15:39 ` Lewis Perin 2009-06-12 16:48 ` B. T. Raven 2009-06-12 17:45 ` Lewis Perin 2009-06-12 17:53 ` Xah Lee 2009-06-12 20:59 ` Lennart Borgman 2009-06-12 22:23 ` ken [not found] ` <e01d8a50906121527k5e77f5abj8c2c44f62f85e537@mail.gmail.com> [not found] ` <4A32E6F6.5080501@mousecar.com> [not found] ` <E1MFKab-0000GU-Dg@fencepost.gnu.org> 2009-06-13 12:30 ` ken [not found] ` <mailman.536.1244845400.2239.help-gnu-emacs@gnu.org> 2009-06-13 0:35 ` Xah Lee 2009-06-12 17:27 ` Xah Lee 2009-06-12 19:30 ` Lewis Perin 2009-06-12 19:43 ` Xah Lee 2009-06-12 20:56 ` B. T. Raven 2009-06-13 16:16 ` Xah Lee 2009-06-13 20:35 ` Lewis Perin 2009-06-14 11:47 ` ken 2009-06-15 7:28 ` Bernardo 2009-06-11 12:03 ` Teemu Likonen 2009-06-08 18:33 ken
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).