those funny non-ASCII characters

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* those funny non-ASCII characters
@ 2012-05-24 23:49 Buchs, Kevin
  2012-05-25  6:36 ` Eli Zaretskii
  0 siblings, 1 reply; 25+ messages in thread
From: Buchs, Kevin @ 2012-05-24 23:49 UTC (permalink / raw)
  To: help-gnu-emacs

I often paste content from web pages into an emacs org-mode buffer and I
get the odd quote characters or dashes that are not ASCII. I created a
lisp function to remove the unicode ones that are just 8 bits. Lately I
am seeing that there are characters that are not being caught. They show
up in emacs as the expected character. When I kill/yank them into lisp
code, they are not being found. When I save the buffer, I am asked for
coding and chose raw text. When the file is opened again, these
characters are showing up as some sort of special symbol (dashed circle
with flag off the top) followed by doubles/triples of \2xx. For example,
the dash character I just stored was this sequence: circle-flag \200
\231. Using Gnu/Linux od to dump them I get hex strings such as: 340 245
206 340 244 206 210 200 and for the dash mentioned above 342 200 231. 

I am very naive in regard to coding, so please excuse my ignorance. I
would guess these are 16-bit (Unicode16) characters. Can someone
enlighten me as to how I can determine what these characters are (after
pasted into a buffer) and how I can code a function to replace them with
ASCII equivalents? The only thing I could think of was hexl mode, but
that didn't turn out well. Thanks.

Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.kevin@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
       [not found] <mailman.1638.1337903381.855.help-gnu-emacs@gnu.org>
@ 2012-05-25  0:56 ` Xah Lee
  0 siblings, 0 replies; 25+ messages in thread
From: Xah Lee @ 2012-05-25  0:56 UTC (permalink / raw)
  To: help-gnu-emacs

On May 24, 4:49 pm, "Buchs, Kevin" <buchs.ke...@mayo.edu> wrote:
> I often paste content from web pages into an emacs org-mode buffer and I
> get the odd quote characters or dashes that are not ASCII. I created a
> lisp function to remove the unicode ones that are just 8 bits. Lately I
> am seeing that there are characters that are not being caught. They show
> up in emacs as the expected character. When I kill/yank them into lisp
> code, they are not being found. When I save the buffer, I am asked for
> coding and chose raw text. When the file is opened again, these
> characters are showing up as some sort of special symbol (dashed circle
> with flag off the top) followed by doubles/triples of \2xx. For example,
> the dash character I just stored was this sequence: circle-flag \200
> \231. Using Gnu/Linux od to dump them I get hex strings such as: 340 245
> 206 340 244 206 210 200 and for the dash mentioned above 342 200 231.
>
> I am very naive in regard to coding, so please excuse my ignorance. I
> would guess these are 16-bit (Unicode16) characters. Can someone
> enlighten me as to how I can determine what these characters are (after
> pasted into a buffer) and how I can code a function to replace them with
> ASCII equivalents? The only thing I could think of was hexl mode, but
> that didn't turn out well. Thanks.


better to embrace unicode than fight it.

what encoding you have when you paste is rather complex. I guess it
depends on the sources you copy from, as each web page can be in diff
charset and encoding then am not sure your OS do some translation in
the pasteboard.

maybe this will help.

〈Emacs File/Character Encoding/Decoding FAQ〉
http://xahlee.org/emacs/emacs_encoding_decoding_faq.html

〈Xah's Unicode Tutorial〉
http://xahlee.org/Periodic_dosage_dir/unicode.html

to replace non-ascii, you can use the regex

[[:nonascii:]]+

〈Char Classes - GNU Emacs Lisp Reference Manual〉
http://xahlee.org/emacs_manual/elisp/Char-Classes.html

〈Emacs Lisp: Convert Unicode String to ASCII (Zap Gremlins)〉
http://xahlee.org/emacs/emacs_zap_gremlins.html

 Xah


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
  2012-05-24 23:49 Buchs, Kevin
@ 2012-05-25  6:36 ` Eli Zaretskii
  0 siblings, 0 replies; 25+ messages in thread
From: Eli Zaretskii @ 2012-05-25  6:36 UTC (permalink / raw)
  To: help-gnu-emacs

> Date: Thu, 24 May 2012 18:49:29 -0500
> From: "Buchs, Kevin" <buchs.kevin@mayo.edu>
> 
> I am very naive in regard to coding, so please excuse my ignorance. I
> would guess these are 16-bit (Unicode16) characters. Can someone
> enlighten me as to how I can determine what these characters are (after
> pasted into a buffer)

With cursor on that character, type "C-u C-x =", and Emacs will show
everything it knows about that character, including its canonical
name.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
@ 2012-05-25 13:40 Buchs, Kevin
  2012-05-25 14:04 ` Eli Zaretskii
  2012-05-25 14:42 ` Jambunathan K
  0 siblings, 2 replies; 25+ messages in thread
From: Buchs, Kevin @ 2012-05-25 13:40 UTC (permalink / raw)
  To: help-gnu-emacs

Thanks, Xah and Eli, for contributing to my further understanding. I
went to a specific website where I got the content I copied and pasted
and I can see from the HTML that it has a charset=UTF-8, so I understand
that is Unicode 8-bit. Using the C-u C-x =, I see that the particular
character I pasted has a code point of 0x2013 (U+2013). I didn't see,
however, what the UTF-8 encoding of that code point was. Should I be
able to read that somewhere on the buffer of information I get with C-u
C-x = ? I was poking around the www.unicode.org website, trying to
understand how this U+2013 code point is encoded into UTF-8, but I
haven't determined that yet.

A fresh buffer in emacs for me on my Win-7 box has an encoding system of
iso-latin-1-dos. The coding system used to open and save files is the
same.

So, help me piece together what happens as I paste the UTF-8 text into a
buffer. First, the paste buffer must define that it is in UTF-8. Emacs
reads this information and inserts it into the byte string that defines
the buffer. Now, how does emacs record that it was a UTF-8 encoded
character? Does it translate it into a different internal encoding
instead of just recording the 8 bits transferred? Is this encoding used
as a superset of all possible encoding systems that emacs supports?

Now,  Xah, you suggest I embrace Unicode. What does that mean? Would it
involve marking all my lisp library files and my org-mode files with the
file variable -*- coding: utf-8 -*- ? Or is there another way to go
Unicode automatically? 

I assume that if my lisp library files are encoded utf-8, then I can
paste that character from the web page into my call to replace-string in
order to substitute the longer dash of Unicode U+2013 with an ascii
hyphen or double hyphen. But, how does that really work? If the lisp
file is encoded utf-8, then how can I put an ascii character in the
replacement string?

I would appreciate it if someone could help me open this new door in my
brain a bit further.

Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.kevin@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg 

-----Original Message-----
With cursor on that character, type "C-u C-x =", and Emacs will show
everything it knows about that character, including its canonical
name.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
  2012-05-25 13:40 Buchs, Kevin
@ 2012-05-25 14:04 ` Eli Zaretskii
  2012-05-25 14:42 ` Jambunathan K
  1 sibling, 0 replies; 25+ messages in thread
From: Eli Zaretskii @ 2012-05-25 14:04 UTC (permalink / raw)
  To: help-gnu-emacs

> Date: Fri, 25 May 2012 08:40:25 -0500
> From: "Buchs, Kevin" <buchs.kevin@mayo.edu>
> 
> Thanks, Xah and Eli, for contributing to my further understanding. I
> went to a specific website where I got the content I copied and pasted
> and I can see from the HTML that it has a charset=UTF-8, so I understand
> that is Unicode 8-bit. Using the C-u C-x =, I see that the particular
> character I pasted has a code point of 0x2013 (U+2013). I didn't see,
> however, what the UTF-8 encoding of that code point was. Should I be
> able to read that somewhere on the buffer of information I get with C-u
> C-x = ?

Yes, this part of "C-u C-x ="'s display:

            file code: #xE2 #x80 #x93 (encoded by coding system utf-8-dos)

shows you how it would be encoded in UTF-8.  If you see something like
"not encodable by ...", then you need to set the buffer's encoding
using "C-x RET f".  Under "file code", Emacs shows how the character
would be encoded if the buffer is saved to a disk file or sent to
another program or as an email message.

> I was poking around the www.unicode.org website, trying to
> understand how this U+2013 code point is encoded into UTF-8, but I
> haven't determined that yet.

See above: Emacs shows this under the right circumstances.

> So, help me piece together what happens as I paste the UTF-8 text into a
> buffer. First, the paste buffer must define that it is in UTF-8.

On Windows, Emacs always uses UTF-16 to pass text via the clipboard,
because doing so lets Emacs copy and paste any character from any
character set on Earth.

> Emacs reads this information and inserts it into the byte string
> that defines the buffer. Now, how does emacs record that it was a
> UTF-8 encoded character?

It doesn't.  What it records is the encoding to be used for the
current buffer if it is saved to disk or sent to some program.  That
encoding is a property of the buffer, not of the characters.

> Does it translate it into a different internal encoding

Yes, it does.

> Is this encoding used
> as a superset of all possible encoding systems that emacs supports?

Yes.  See the section "Text Representations" in the ELisp manual that
comes with Emacs, you will find the details there.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
  2012-05-25 13:40 Buchs, Kevin
  2012-05-25 14:04 ` Eli Zaretskii
@ 2012-05-25 14:42 ` Jambunathan K
  1 sibling, 0 replies; 25+ messages in thread
From: Jambunathan K @ 2012-05-25 14:42 UTC (permalink / raw)
  To: Buchs, Kevin; +Cc: help-gnu-emacs


I think this will help.

  (prefer-coding-system 'utf-8)

-- 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
       [not found] <mailman.1665.1337953237.855.help-gnu-emacs@gnu.org>
@ 2012-05-25 18:33 ` Xah Lee
  0 siblings, 0 replies; 25+ messages in thread
From: Xah Lee @ 2012-05-25 18:33 UTC (permalink / raw)
  To: help-gnu-emacs


hope Eli answered all your questions.

here's some addition.

• embrace unicode, because it's just going to be more and more.
Programing Languages are all default on unicode by spec (e.g. any html/
css/JavaScript, and Java, Haskell, …). Most OS (Windows, Mac) and file
systems all default to unicode encoding now (not sure about linux).
Even emacs, starting with emacs 23, uses unicode as default internal
encoding.

〈Unicode Popularity on Web by Google〉
http://xahlee.org/comp/unicode_on_web.html

• Unicode is about 2 things: ① a char set with a integer ID for each
char. ② several encoding for the char set, most popular being utf-8
and utf-16 (the latter are default on Mac, Windows). (encoding is a
standard that changes a char from a char set into byte sequence)

• in emacs, just put this in your init:
(set-language-environment "UTF-8")

that should put all encoding to utf-8, and shouldn't cause you any
problem if all your curretn file and elisp file are ascii, because
ascii encoding is compatible/subset of utf-8/unicode.

• in emacs, call describe-car. That'll show the current char's
encoding as well as byte sequence used for that particular encoding.
(this is emacs 24. Emacs 23 may not show the byte sequence... i don't
recall.)

my unicode tutorial covers all these… feel free to ask me, or here, of
course.

 Xah


On May 25, 6:40 am, "Buchs, Kevin" <buchs.ke...@mayo.edu> wrote:
> Thanks, Xah and Eli, for contributing to my further understanding. I
> went to a specific website where I got the content I copied and pasted
> and I can see from the HTML that it has a charset=UTF-8, so I understand
> that is Unicode 8-bit. Using the C-u C-x =, I see that the particular
> character I pasted has a code point of 0x2013 (U+2013). I didn't see,
> however, what the UTF-8 encoding of that code point was. Should I be
> able to read that somewhere on the buffer of information I get with C-u
> C-x = ? I was poking around thewww.unicode.orgwebsite, trying to
> understand how this U+2013 code point is encoded into UTF-8, but I
> haven't determined that yet.
>
> A fresh buffer in emacs for me on my Win-7 box has an encoding system of
> iso-latin-1-dos. The coding system used to open and save files is the
> same.
>
> So, help me piece together what happens as I paste the UTF-8 text into a
> buffer. First, the paste buffer must define that it is in UTF-8. Emacs
> reads this information and inserts it into the byte string that defines
> the buffer. Now, how does emacs record that it was a UTF-8 encoded
> character? Does it translate it into a different internal encoding
> instead of just recording the 8 bits transferred? Is this encoding used
> as a superset of all possible encoding systems that emacs supports?
>
> Now,  Xah, you suggest I embrace Unicode. What does that mean? Would it
> involve marking all my lisp library files and my org-mode files with the
> file variable -*- coding: utf-8 -*- ? Or is there another way to go
> Unicode automatically?
>
> I assume that if my lisp library files are encoded utf-8, then I can
> paste that character from the web page into my call to replace-string in
> order to substitute the longer dash of Unicode U+2013 with an ascii
> hyphen or double hyphen. But, how does that really work? If the lisp
> file is encoded utf-8, then how can I put an ascii character in the
> replacement string?
>
> I would appreciate it if someone could help me open this new door in my
> brain a bit further.
>
> Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
> buchs.ke...@mayo.edu
> Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |http://www.mayo.edu/sppdg
>
> -----Original Message-----
>
> With cursor on that character, type "C-u C-x =", and Emacs will show
> everything it knows about that character, including its canonical
> name.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: those funny non-ASCII characters
@ 2012-05-30 17:15 Buchs, Kevin
  2012-05-31  7:17 ` Thien-Thi Nguyen
  2012-05-31 15:59 ` PJ Weisberg
  0 siblings, 2 replies; 25+ messages in thread
From: Buchs, Kevin @ 2012-05-30 17:15 UTC (permalink / raw)
  To: help-gnu-emacs

I am reposting some of my questions from last Friday (plus a few more),
as I am still seeking assistance and there has been a lot of water over
the dam on this list.

Xah suggested I embrace Unicode. So I could use (prefer-coding-system
'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
to the former? What about opening an ASCII coded file? Can emacs
properly detect it or does it come up as UTF-8? Or is there another way
to go Unicode automatically? If I embrace Unicode, then should I make my
Org-mode files no longer plain text?

I assume that if my lisp library files are encoded utf-8, then I can
paste that UTF-8 character from the web page into my call to
(replace-string ...) in order to substitute the longer dash of Unicode
U+2013 with an ASCII hyphen or double hyphen. But, how does that really
work? If the lisp file is encoded utf-8, then how can I put an ASCII
character in the replacement string? Or do I need to encode the hex
value of the ASCII character(s)?

Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.kevin@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
  2012-05-30 17:15 those funny non-ASCII characters Buchs, Kevin
@ 2012-05-31  7:17 ` Thien-Thi Nguyen
  2012-05-31 14:57   ` Buchs, Kevin
  2012-05-31 15:59 ` PJ Weisberg
  1 sibling, 1 reply; 25+ messages in thread
From: Thien-Thi Nguyen @ 2012-05-31  7:17 UTC (permalink / raw)
  To: Buchs, Kevin; +Cc: help-gnu-emacs

() "Buchs, Kevin" <buchs.kevin@mayo.edu>
() Wed, 30 May 2012 12:15:11 -0500

   I am reposting some of my questions from last Friday (plus a few more),
   as I am still seeking assistance and there has been a lot of water over
   the dam on this list.

Does this mean you are ignoring the previous responses?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: those funny non-ASCII characters
  2012-05-31  7:17 ` Thien-Thi Nguyen
@ 2012-05-31 14:57   ` Buchs, Kevin
  2012-05-31 16:40     ` Thien-Thi Nguyen
  0 siblings, 1 reply; 25+ messages in thread
From: Buchs, Kevin @ 2012-05-31 14:57 UTC (permalink / raw)
  To: Thien-Thi Nguyen; +Cc: help-gnu-emacs

> Does this mean you are ignoring the previous responses?

Thien-Thi,

I did not intend to ignore any prior responses. I apologize if I have
missed some. I noted responses from Xah Lee, Eli Zaretskii and
Jambunathan. There was one other, for which I did not record the name.
Have I missed more? Please let me know if I have. I note that I get the
digests of this list.

My reason for reposting is that I didn't not have the answers to all the
questions I originally asked AND I had some additional questions. Did
you feel like the questions I reposted were in fact answered? If so,
perhaps I misunderstood.

Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.kevin@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
  2012-05-30 17:15 those funny non-ASCII characters Buchs, Kevin
  2012-05-31  7:17 ` Thien-Thi Nguyen
@ 2012-05-31 15:59 ` PJ Weisberg
  1 sibling, 0 replies; 25+ messages in thread
From: PJ Weisberg @ 2012-05-31 15:59 UTC (permalink / raw)
  To: Buchs, Kevin; +Cc: help-gnu-emacs@gnu.org

[-- Attachment #1: Type: text/plain, Size: 1436 bytes --]

On Wednesday, May 30, 2012, Buchs, Kevin <buchs.kevin@mayo.edu> wrote:

> What about opening an ASCII coded file? Can emacs
> properly detect it or does it come up as UTF-8?

Emacs attempts to determine the correct coding system when it opens a file,
so you shouldn't have to worry about this.

The 128 characters that make up ASCII have the exact same representation in
UTF-8.  "Converting" as ASCII file to UTF-8 is a no-op.  Therefore,
treating an ASCII file as UTF-8 should cause no problems.

> I assume that if my lisp library files are encoded utf-8, then I can
> paste that UTF-8 character from the web page into my call to
> (replace-string ...) in order to substitute the longer dash of Unicode
> U+2013 with an ASCII hyphen or double hyphen. But, how does that really
> work? If the lisp file is encoded utf-8, then how can I put an ASCII
> character in the replacement string? Or do I need to encode the hex
> value of the ASCII character(s)?

A = A.  The hyphen-minus is a hyphen-minus whether it's in an ASCII file as
00101101 or a UTF-16 file as 0000000000101101.  So, just type it with your
keyboard.

BTW, I don't know how Xah intended it, but when he said to "embrace
unicode," I interpreted it to mean, "Why don't you just leave em-dashes as
em-dashes instead of replacing them with two hyphen-minuses?"

-- 
-PJ

Gehm's Corollary to Clark's Law: Any technology distinguishable from
magic is insufficiently advanced.

[-- Attachment #2: Type: text/html, Size: 1655 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
  2012-05-31 14:57   ` Buchs, Kevin
@ 2012-05-31 16:40     ` Thien-Thi Nguyen
  2012-05-31 16:56       ` Buchs, Kevin
  0 siblings, 1 reply; 25+ messages in thread
From: Thien-Thi Nguyen @ 2012-05-31 16:40 UTC (permalink / raw)
  To: Buchs, Kevin; +Cc: help-gnu-emacs

() "Buchs, Kevin" <buchs.kevin@mayo.edu>
() Thu, 31 May 2012 09:57:37 -0500

   Did you feel like the questions I reposted were in fact answered?
   If so, perhaps I misunderstood.

I am simply ignorant of "water over the dam".
My mistake was not asking it directly:
What do you mean by "water over the dam"?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: those funny non-ASCII characters
  2012-05-31 16:40     ` Thien-Thi Nguyen
@ 2012-05-31 16:56       ` Buchs, Kevin
  2012-05-31 21:46         ` Thien-Thi Nguyen
       [not found]         ` <mailman.2041.1338500734.855.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 25+ messages in thread
From: Buchs, Kevin @ 2012-05-31 16:56 UTC (permalink / raw)
  To: Thien-Thi Nguyen, help-gnu-emacs

> I am simply ignorant of "water over the dam".
> My mistake was not asking it directly:
> What do you mean by "water over the dam"?

Thien-Thi,

No problem. It is an expression meaning, in general, that lots of events
have come and gone, and are now passed. It is an analogy to the flowing
of water over a dam in a river, in the sense that once water flows up
over a dam, it is going downstream and has passed the reservoir behind
the dam and presumably passed your field of view. 

In this specific instance I was referring to a large number of messages
having been posted to the email list by several people discussing a
topic that Xah Lee brought up. So, the busy-ness of the list made me
think that perhaps there were some people who were going to reply, but
the number of messages coming from the list got to be so large that they
just deleted them including my message or lost my message in inbox
clutter. 

I could have applied the analogy of "water over the dam" even further to
say that: "though there were many messages posted on this list that are
now water over the dam, I would like to bring my message back to allow
it to float to the top again."

Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.kevin@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
  2012-05-31 16:56       ` Buchs, Kevin
@ 2012-05-31 21:46         ` Thien-Thi Nguyen
  2012-06-01 13:36           ` Doug Lewan
       [not found]         ` <mailman.2041.1338500734.855.help-gnu-emacs@gnu.org>
  1 sibling, 1 reply; 25+ messages in thread
From: Thien-Thi Nguyen @ 2012-05-31 21:46 UTC (permalink / raw)
  To: Buchs, Kevin; +Cc: help-gnu-emacs

() "Buchs, Kevin" <buchs.kevin@mayo.edu>
() Thu, 31 May 2012 11:56:45 -0500

   [...] once water flows up over a dam, it is going downstream
   and has passed the reservoir behind the dam and presumably
   passed your field of view.

   In this specific instance [...]

OK, thanks.  Now i understand.

   I could have applied the analogy of "water over the dam" even
   further to say that: "though there were many messages posted on
   this list that are now water over the dam, I would like to
   bring my message back to allow it to float to the top again."

The flow of messages is indeed like water.

I suppose everyone relates to this in their own way.

Using GNUS (now Gnus) to read these, i imagine myself an insect
buzzing around an upward turned flow (a geiser), first in summer
when the molecules dissociate quickly, then (later, always later)
in winter when they crystalize shard-like and treed, sometimes
under a brilliant sun refracted as rainbows, sometimes under a
brilliant moon that ghostly glows, sometimes in darkness lit only
by lucky grep rows.  A drip gleaned here and there for sustenance,
a drop left there and here for assonance, the rest left to what
entropy can penetrate the disks of gmane.

Anyway, Unicode is ASCII-compatible, so probably if you wrangle
your environment to Unicode by default, Emacs will also DTRT.
Check out <http://www.utf8everywhere.org>.  Yes, it does touch
upon topics best avoided in polite company, but oh well...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
       [not found]         ` <mailman.2041.1338500734.855.help-gnu-emacs@gnu.org>
@ 2012-06-01  2:42           ` rusi
  0 siblings, 0 replies; 25+ messages in thread
From: rusi @ 2012-06-01  2:42 UTC (permalink / raw)
  To: help-gnu-emacs

On Jun 1, 2:46 am, Thien-Thi Nguyen <t...@gnuvola.org> wrote:
> Anyway, Unicode is ASCII-compatible, so probably if you wrangle
> your environment to Unicode by default, Emacs will also DTRT.
> Check out <http://www.utf8everywhere.org>.

Thanks very useful

> Yes, it does touch
> upon topics best avoided in polite company, but oh well...

??


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
       [not found] <mailman.1961.1338398127.855.help-gnu-emacs@gnu.org>
@ 2012-06-01  4:23 ` Jason Rumney
  2012-06-01  5:43   ` rusi
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Rumney @ 2012-06-01  4:23 UTC (permalink / raw)
  To: help-gnu-emacs; +Cc: help-gnu-emacs

On Thursday, 31 May 2012 01:15:11 UTC+8, Buchs, Kevin  wrote:

> Xah suggested I embrace Unicode. So I could use (prefer-coding-system
> 'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
> to the former? What about opening an ASCII coded file? Can emacs
> properly detect it or does it come up as UTF-8?

ASCII is a subset of UTF-8, so the problem you are imagining does not exist.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
  2012-06-01  4:23 ` Jason Rumney
@ 2012-06-01  5:43   ` rusi
  2012-06-01  6:12     ` Eli Zaretskii
  2012-06-01  7:03     ` Xah Lee
  0 siblings, 2 replies; 25+ messages in thread
From: rusi @ 2012-06-01  5:43 UTC (permalink / raw)
  To: help-gnu-emacs

On Jun 1, 9:23 am, Jason Rumney <jasonrum...@gmail.com> wrote:
> On Thursday, 31 May 2012 01:15:11 UTC+8, Buchs, Kevin  wrote:
> > Xah suggested I embrace Unicode. So I could use (prefer-coding-system
> > 'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
> > to the former? What about opening an ASCII coded file? Can emacs
> > properly detect it or does it come up as UTF-8?
>
> ASCII is a subset of UTF-8, so the problem you are imagining does not exist.

This does not exactly work that way on windows.
eg recently saw a description of how notepad put a BOM mark in a
haskell-script which made the haskell scripts unrunnable


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
  2012-06-01  5:43   ` rusi
@ 2012-06-01  6:12     ` Eli Zaretskii
  2012-06-01  7:03     ` Xah Lee
  1 sibling, 0 replies; 25+ messages in thread
From: Eli Zaretskii @ 2012-06-01  6:12 UTC (permalink / raw)
  To: help-gnu-emacs

> From: rusi <rustompmody@gmail.com>
> Date: Thu, 31 May 2012 22:43:07 -0700 (PDT)
> 
> On Jun 1, 9:23 am, Jason Rumney <jasonrum...@gmail.com> wrote:
> > On Thursday, 31 May 2012 01:15:11 UTC+8, Buchs, Kevin  wrote:
> > > Xah suggested I embrace Unicode. So I could use (prefer-coding-system
> > > 'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
> > > to the former? What about opening an ASCII coded file? Can emacs
> > > properly detect it or does it come up as UTF-8?
> >
> > ASCII is a subset of UTF-8, so the problem you are imagining does not exist.
> 
> This does not exactly work that way on windows.
> eg recently saw a description of how notepad put a BOM mark in a
> haskell-script which made the haskell scripts unrunnable

We are talking about Emacs, not about Notepad, so it's unclear to me
how what Notepad does is relevant to the OP's question.




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
  2012-06-01  5:43   ` rusi
  2012-06-01  6:12     ` Eli Zaretskii
@ 2012-06-01  7:03     ` Xah Lee
  2012-06-01 16:26       ` rusi
  1 sibling, 1 reply; 25+ messages in thread
From: Xah Lee @ 2012-06-01  7:03 UTC (permalink / raw)
  To: help-gnu-emacs

On May 31, 10:43 pm, rusi <rustompm...@gmail.com> wrote:
> On Jun 1, 9:23 am, Jason Rumney <jasonrum...@gmail.com> wrote:
>
> > On Thursday, 31 May 2012 01:15:11 UTC+8, Buchs, Kevin  wrote:
> > > Xah suggested I embrace Unicode. So I could use (prefer-coding-system
> > > 'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
> > > to the former? What about opening an ASCII coded file? Can emacs
> > > properly detect it or does it come up as UTF-8?
>
> > ASCII is a subset of UTF-8, so the problem you are imagining does not exist.
>
> This does not exactly work that way on windows.
> eg recently saw a description of how notepad put a BOM mark in a
> haskell-script which made the haskell scripts unrunnable

haskell compiler probably should bear the blame. Last i read (~4 years
ago), the lang spec says source code should be unicode (i forgot if it
specified a encoding), however, no haskell compiler at the time
supports it. If your lang spec says unicode, you have to support BOM
mark.

〈Unicode BOM Byte Order Mark Hack〉
http://xahlee.org/comp/unicode_BOM_byte_orde_mark.html

http://www.unicode.org/faq/utf_bom.html#bom1

 Xah


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: those funny non-ASCII characters
  2012-05-31 21:46         ` Thien-Thi Nguyen
@ 2012-06-01 13:36           ` Doug Lewan
  0 siblings, 0 replies; 25+ messages in thread
From: Doug Lewan @ 2012-06-01 13:36 UTC (permalink / raw)
  To: Thien-Thi Nguyen, help-gnu-emacs@gnu.org

Thanks for the UTF-8 pointer. I never appreciated just how complex this is.

When you get people involved in software it just sucks and it shouldn't.

> -----Original Message-----
> From: help-gnu-emacs-bounces+dougl=shubertticketing.com@gnu.org
> [mailto:help-gnu-emacs-bounces+dougl=shubertticketing.com@gnu.org] On
> Behalf Of Thien-Thi Nguyen
> Sent: Thursday, 2012 May 31 17:46
> To: Buchs, Kevin
> Cc: help-gnu-emacs@gnu.org
> Subject: Re: those funny non-ASCII characters
> Anyway, Unicode is ASCII-compatible, so probably if you wrangle
> your environment to Unicode by default, Emacs will also DTRT.
> Check out <http://www.utf8everywhere.org>.  Yes, it does touch
> upon topics best avoided in polite company, but oh well...
> 




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
  2012-06-01  7:03     ` Xah Lee
@ 2012-06-01 16:26       ` rusi
  2012-06-01 21:06         ` Xah Lee
  0 siblings, 1 reply; 25+ messages in thread
From: rusi @ 2012-06-01 16:26 UTC (permalink / raw)
  To: help-gnu-emacs

On Jun 1, 12:03 pm, Xah Lee <xah...@gmail.com> wrote:
> On May 31, 10:43 pm, rusi <rustompm...@gmail.com> wrote:
>
> > On Jun 1, 9:23 am, Jason Rumney <jasonrum...@gmail.com> wrote:
>
> > > On Thursday, 31 May 2012 01:15:11 UTC+8, Buchs, Kevin  wrote:
> > > > Xah suggested I embrace Unicode. So I could use (prefer-coding-system
> > > > 'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
> > > > to the former? What about opening an ASCII coded file? Can emacs
> > > > properly detect it or does it come up as UTF-8?
>
> > > ASCII is a subset of UTF-8, so the problem you are imagining does not exist.
>
> > This does not exactly work that way on windows.
> > eg recently saw a description of how notepad put a BOM mark in a
> > haskell-script which made the haskell scripts unrunnable
>
> haskell compiler probably should bear the blame. Last i read (~4 years
> ago), the lang spec says source code should be unicode (i forgot if it
> specified a encoding), however, no haskell compiler at the time
> supports it. If your lang spec says unicode, you have to support BOM
> mark.
>
> 〈Unicode BOM Byte Order Mark Hack〉http://xahlee.org/comp/unicode_BOM_byte_orde_mark.html
>
> http://www.unicode.org/faq/utf_bom.html#bom1
>
>  Xah

See http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf
(pg 36) "Use of a BOM is neither required nor recommended for UTF-8,
but may
be encountered in contexts where UTF-8 data is converted from other
encoding forms..."

More specifically the non-recommendation of bom: http://www.unicode.org/faq/utf_bom.html
"Note that some recipients of UTF-8 encoded data do not expect a BOM.
Where UTF-8 is used transparently in 8-bit environments, the use of a
BOM will interfere with any protocol or file format that expects
specific ASCII characters at the beginning, such as the use of "#!" of
at the beginning of Unix shell scripts. "


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
  2012-06-01 16:26       ` rusi
@ 2012-06-01 21:06         ` Xah Lee
  2012-06-02  3:17           ` rusi
  0 siblings, 1 reply; 25+ messages in thread
From: Xah Lee @ 2012-06-01 21:06 UTC (permalink / raw)
  To: help-gnu-emacs

Xah wrote
> > 〈Unicode BOM Byte Order Mark Hack〉 http://xahlee.org/comp/unicode_BOM_byte_orde_mark.html
>
> > http://www.unicode.org/faq/utf_bom.html#bom1

On Jun 1, 9:26 am, rusi <rustompm...@gmail.com> wrote:

> See http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf
> (pg 36) "Use of a BOM is neither required nor recommended for UTF-8,
> but may
> be encountered in contexts where UTF-8 data is converted from other
> encoding forms..."
>
> More specifically the non-recommendation of bom: http://www.unicode.org/faq/utf_bom.html
> "Note that some recipients of UTF-8 encoded data do not expect a BOM.
> Where UTF-8 is used transparently in 8-bit environments, the use of a
> BOM will interfere with any protocol or file format that expects
> specific ASCII characters at the beginning, such as the use of "#!" of
> at the beginning of Unix shell scripts. "

didn't i mention these 2 points exactly in the link i gave??

 Xah


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
  2012-06-01 21:06         ` Xah Lee
@ 2012-06-02  3:17           ` rusi
  2012-06-02 11:54             ` Xah Lee
  0 siblings, 1 reply; 25+ messages in thread
From: rusi @ 2012-06-02  3:17 UTC (permalink / raw)
  To: help-gnu-emacs

On Jun 2, 2:06 am, Xah Lee <xah...@gmail.com> wrote:
> Xah wrote
>
> > > 〈Unicode BOM Byte Order Mark Hack〉http://xahlee.org/comp/unicode_BOM_byte_orde_mark.html
>
> > >http://www.unicode.org/faq/utf_bom.html#bom1
>
> On Jun 1, 9:26 am, rusi <rustompm...@gmail.com> wrote:
>
> > Seehttp://www.unicode.org/versions/Unicode5.0.0/ch02.pdf
> > (pg 36) "Use of a BOM is neither required nor recommended for UTF-8,
> > but may
> > be encountered in contexts where UTF-8 data is converted from other
> > encoding forms..."
>
> > More specifically the non-recommendation of bom:http://www.unicode.org/faq/utf_bom.html
> > "Note that some recipients of UTF-8 encoded data do not expect a BOM.
> > Where UTF-8 is used transparently in 8-bit environments, the use of a
> > BOM will interfere with any protocol or file format that expects
> > specific ASCII characters at the beginning, such as the use of "#!" of
> > at the beginning of Unix shell scripts. "
>
> didn't i mention these 2 points exactly in the link i gave??

Yeah your own link says this: (as you know I often use and quote your
unicode pages :-) )

- In unix-like OSes, BOM for utf-8 conflicts with the Shebang (Unix)
hack.
- Many Window software add BOM to utf-8 files, e.g. Notepad.

But you also say

> If your lang spec says unicode, you have to support BOM mark

So I am not clear whats ur stand...

Let me make my own position clear:
The de jure unicode standard is set by the unicode consortium (or
whatever its called)
The de facto standard is set by microsoft and java
The two conflict


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
  2012-06-02  3:17           ` rusi
@ 2012-06-02 11:54             ` Xah Lee
  2012-06-02 14:10               ` Xah Lee
  0 siblings, 1 reply; 25+ messages in thread
From: Xah Lee @ 2012-06-02 11:54 UTC (permalink / raw)
  To: help-gnu-emacs

On Jun 1, 8:17 pm, rusi <rustompm...@gmail.com> wrote:
> On Jun 2, 2:06 am, Xah Lee <xah...@gmail.com> wrote:
>
>
> > Xah wrote
>
> > > > 〈Unicode BOM Byte Order Mark Hack〉http://xahlee.org/comp/unicode_BOM_byte_orde_mark.html
>
> > > >http://www.unicode.org/faq/utf_bom.html#bom1
>
> > On Jun 1, 9:26 am, rusi <rustompm...@gmail.com> wrote:
>
> > > Seehttp://www.unicode.org/versions/Unicode5.0.0/ch02.pdf
> > > (pg 36) "Use of a BOM is neither required nor recommended for UTF-8,
> > > but may
> > > be encountered in contexts where UTF-8 data is converted from other
> > > encoding forms..."
>
> > > More specifically the non-recommendation of bom:http://www.unicode.org/faq/utf_bom.html
> > > "Note that some recipients of UTF-8 encoded data do not expect a BOM.
> > > Where UTF-8 is used transparently in 8-bit environments, the use of a
> > > BOM will interfere with any protocol or file format that expects
> > > specific ASCII characters at the beginning, such as the use of "#!" of
> > > at the beginning of Unix shell scripts. "
>
> > didn't i mention these 2 points exactly in the link i gave??
>
> Yeah your own link says this: (as you know I often use and quote your
> unicode pages :-) )
>
> - In unix-like OSes, BOM for utf-8 conflicts with the Shebang (Unix)
> hack.
> - Many Window software add BOM to utf-8 files, e.g. Notepad.
>
> But you also say
>
> > If your lang spec says unicode, you have to support BOM mark
>
> So I am not clear whats ur stand...
>
> Let me make my own position clear:
> The de jure unicode standard is set by the unicode consortium (or
> whatever its called)
> The de facto standard is set by microsoft and java
> The two conflict

BOM mark is part of the unicode standard. If a tech declares full
support for unicode, support for BOM mark is necessary.

BOM mark is a hack, but so is unix shebang mark. BOM mark being a
given, it wouldn't have any problem if utf-8 isn't invented. utf-8 is
invented by unix fanatic Rob Pike largely to help unix world move
forward to unicode. As it is, BOM mark conflict with the spirit of
utf-8 (because utf-8 is meant to be ASCII compatible as is, yet BOM
mark byte sequence isn't in ASCII.)

i read the link Thien-Thin Nguyen posted 〔http://
www.utf8everywhere.org/〕. At first i find it very informative, but in
the end i wasn't convinced in its opinion that we should all adopt
utf-8 instead of utf-16. I think if one switch a attitude, that utf-8
is the hack that introduced all this problems, then many of their
argument for utf-8 doesn't stand.

side note... about that site, it's Windows oriented. As such, they
didn't explain many terms and Windows tech they use, e.g. i have
little idea what narrowchar or widechar they mean, nor of the many
Windows libraries they mention.

also, the site is decidedly western-mind oriented. They forgot that in
china, the encoding used is GB 18030, which has the same char set as
unicode but different encoding, and is also compatible with ascii. No
utf-8 nor utf-anything whatsoever. Chinese web traffic are like half
of the world's or something.

the site wishes utf-16 to go away. Windows, Mac, NTFS, HFS+ file
systems, all utf-16, plus java C# etc. Though, the web (html,xml,css)
are all utf-8. Neither are likely to go away. If Java and C# and NTFS
disappeared from the face of this earth, then maybe. lol. :D

 Xah

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: those funny non-ASCII characters
  2012-06-02 11:54             ` Xah Lee
@ 2012-06-02 14:10               ` Xah Lee
  0 siblings, 0 replies; 25+ messages in thread
From: Xah Lee @ 2012-06-02 14:10 UTC (permalink / raw)
  To: help-gnu-emacs

Correction: utf-8 is invented by (both) Ken Thompson and Rob Pike.
Only the latter is unix fanatic.

 Xah


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2012-06-02 14:10 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-30 17:15 those funny non-ASCII characters Buchs, Kevin
2012-05-31  7:17 ` Thien-Thi Nguyen
2012-05-31 14:57   ` Buchs, Kevin
2012-05-31 16:40     ` Thien-Thi Nguyen
2012-05-31 16:56       ` Buchs, Kevin
2012-05-31 21:46         ` Thien-Thi Nguyen
2012-06-01 13:36           ` Doug Lewan
     [not found]         ` <mailman.2041.1338500734.855.help-gnu-emacs@gnu.org>
2012-06-01  2:42           ` rusi
2012-05-31 15:59 ` PJ Weisberg
     [not found] <mailman.1961.1338398127.855.help-gnu-emacs@gnu.org>
2012-06-01  4:23 ` Jason Rumney
2012-06-01  5:43   ` rusi
2012-06-01  6:12     ` Eli Zaretskii
2012-06-01  7:03     ` Xah Lee
2012-06-01 16:26       ` rusi
2012-06-01 21:06         ` Xah Lee
2012-06-02  3:17           ` rusi
2012-06-02 11:54             ` Xah Lee
2012-06-02 14:10               ` Xah Lee
     [not found] <mailman.1665.1337953237.855.help-gnu-emacs@gnu.org>
2012-05-25 18:33 ` Xah Lee
  -- strict thread matches above, loose matches on Subject: below --
2012-05-25 13:40 Buchs, Kevin
2012-05-25 14:04 ` Eli Zaretskii
2012-05-25 14:42 ` Jambunathan K
     [not found] <mailman.1638.1337903381.855.help-gnu-emacs@gnu.org>
2012-05-25  0:56 ` Xah Lee
2012-05-24 23:49 Buchs, Kevin
2012-05-25  6:36 ` Eli Zaretskii

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.