unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* RE: those funny non-ASCII characters
@ 2012-05-30 17:15 Buchs, Kevin
  2012-05-31  7:17 ` Thien-Thi Nguyen
  2012-05-31 15:59 ` PJ Weisberg
  0 siblings, 2 replies; 25+ messages in thread
From: Buchs, Kevin @ 2012-05-30 17:15 UTC (permalink / raw)
  To: help-gnu-emacs

I am reposting some of my questions from last Friday (plus a few more),
as I am still seeking assistance and there has been a lot of water over
the dam on this list.

Xah suggested I embrace Unicode. So I could use (prefer-coding-system
'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
to the former? What about opening an ASCII coded file? Can emacs
properly detect it or does it come up as UTF-8? Or is there another way
to go Unicode automatically? If I embrace Unicode, then should I make my
Org-mode files no longer plain text?

I assume that if my lisp library files are encoded utf-8, then I can
paste that UTF-8 character from the web page into my call to
(replace-string ...) in order to substitute the longer dash of Unicode
U+2013 with an ASCII hyphen or double hyphen. But, how does that really
work? If the lisp file is encoded utf-8, then how can I put an ASCII
character in the replacement string? Or do I need to encode the hex
value of the ASCII character(s)?

Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.kevin@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg 



^ permalink raw reply	[flat|nested] 25+ messages in thread
[parent not found: <mailman.1961.1338398127.855.help-gnu-emacs@gnu.org>]
[parent not found: <mailman.1665.1337953237.855.help-gnu-emacs@gnu.org>]
* Re: those funny non-ASCII characters
@ 2012-05-25 13:40 Buchs, Kevin
  2012-05-25 14:04 ` Eli Zaretskii
  2012-05-25 14:42 ` Jambunathan K
  0 siblings, 2 replies; 25+ messages in thread
From: Buchs, Kevin @ 2012-05-25 13:40 UTC (permalink / raw)
  To: help-gnu-emacs

Thanks, Xah and Eli, for contributing to my further understanding. I
went to a specific website where I got the content I copied and pasted
and I can see from the HTML that it has a charset=UTF-8, so I understand
that is Unicode 8-bit. Using the C-u C-x =, I see that the particular
character I pasted has a code point of 0x2013 (U+2013). I didn't see,
however, what the UTF-8 encoding of that code point was. Should I be
able to read that somewhere on the buffer of information I get with C-u
C-x = ? I was poking around the www.unicode.org website, trying to
understand how this U+2013 code point is encoded into UTF-8, but I
haven't determined that yet.

A fresh buffer in emacs for me on my Win-7 box has an encoding system of
iso-latin-1-dos. The coding system used to open and save files is the
same.

So, help me piece together what happens as I paste the UTF-8 text into a
buffer. First, the paste buffer must define that it is in UTF-8. Emacs
reads this information and inserts it into the byte string that defines
the buffer. Now, how does emacs record that it was a UTF-8 encoded
character? Does it translate it into a different internal encoding
instead of just recording the 8 bits transferred? Is this encoding used
as a superset of all possible encoding systems that emacs supports?

Now,  Xah, you suggest I embrace Unicode. What does that mean? Would it
involve marking all my lisp library files and my org-mode files with the
file variable -*- coding: utf-8 -*- ? Or is there another way to go
Unicode automatically? 

I assume that if my lisp library files are encoded utf-8, then I can
paste that character from the web page into my call to replace-string in
order to substitute the longer dash of Unicode U+2013 with an ascii
hyphen or double hyphen. But, how does that really work? If the lisp
file is encoded utf-8, then how can I put an ascii character in the
replacement string?

I would appreciate it if someone could help me open this new door in my
brain a bit further.

Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.kevin@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg 

-----Original Message-----
With cursor on that character, type "C-u C-x =", and Emacs will show
everything it knows about that character, including its canonical
name.



^ permalink raw reply	[flat|nested] 25+ messages in thread
[parent not found: <mailman.1638.1337903381.855.help-gnu-emacs@gnu.org>]
* those funny non-ASCII characters
@ 2012-05-24 23:49 Buchs, Kevin
  2012-05-25  6:36 ` Eli Zaretskii
  0 siblings, 1 reply; 25+ messages in thread
From: Buchs, Kevin @ 2012-05-24 23:49 UTC (permalink / raw)
  To: help-gnu-emacs

I often paste content from web pages into an emacs org-mode buffer and I
get the odd quote characters or dashes that are not ASCII. I created a
lisp function to remove the unicode ones that are just 8 bits. Lately I
am seeing that there are characters that are not being caught. They show
up in emacs as the expected character. When I kill/yank them into lisp
code, they are not being found. When I save the buffer, I am asked for
coding and chose raw text. When the file is opened again, these
characters are showing up as some sort of special symbol (dashed circle
with flag off the top) followed by doubles/triples of \2xx. For example,
the dash character I just stored was this sequence: circle-flag \200
\231. Using Gnu/Linux od to dump them I get hex strings such as: 340 245
206 340 244 206 210 200 and for the dash mentioned above 342 200 231. 

I am very naive in regard to coding, so please excuse my ignorance. I
would guess these are 16-bit (Unicode16) characters. Can someone
enlighten me as to how I can determine what these characters are (after
pasted into a buffer) and how I can code a function to replace them with
ASCII equivalents? The only thing I could think of was hexl mode, but
that didn't turn out well. Thanks.

Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.kevin@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg 



^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2012-06-02 14:10 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-30 17:15 those funny non-ASCII characters Buchs, Kevin
2012-05-31  7:17 ` Thien-Thi Nguyen
2012-05-31 14:57   ` Buchs, Kevin
2012-05-31 16:40     ` Thien-Thi Nguyen
2012-05-31 16:56       ` Buchs, Kevin
2012-05-31 21:46         ` Thien-Thi Nguyen
2012-06-01 13:36           ` Doug Lewan
     [not found]         ` <mailman.2041.1338500734.855.help-gnu-emacs@gnu.org>
2012-06-01  2:42           ` rusi
2012-05-31 15:59 ` PJ Weisberg
     [not found] <mailman.1961.1338398127.855.help-gnu-emacs@gnu.org>
2012-06-01  4:23 ` Jason Rumney
2012-06-01  5:43   ` rusi
2012-06-01  6:12     ` Eli Zaretskii
2012-06-01  7:03     ` Xah Lee
2012-06-01 16:26       ` rusi
2012-06-01 21:06         ` Xah Lee
2012-06-02  3:17           ` rusi
2012-06-02 11:54             ` Xah Lee
2012-06-02 14:10               ` Xah Lee
     [not found] <mailman.1665.1337953237.855.help-gnu-emacs@gnu.org>
2012-05-25 18:33 ` Xah Lee
  -- strict thread matches above, loose matches on Subject: below --
2012-05-25 13:40 Buchs, Kevin
2012-05-25 14:04 ` Eli Zaretskii
2012-05-25 14:42 ` Jambunathan K
     [not found] <mailman.1638.1337903381.855.help-gnu-emacs@gnu.org>
2012-05-25  0:56 ` Xah Lee
2012-05-24 23:49 Buchs, Kevin
2012-05-25  6:36 ` Eli Zaretskii

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).