From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Xah Lee Newsgroups: gmane.emacs.help Subject: Re: those funny non-ASCII characters Date: Fri, 25 May 2012 11:33:51 -0700 (PDT) Organization: http://groups.google.com Message-ID: References: NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1337971517 16238 80.91.229.3 (25 May 2012 18:45:17 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 25 May 2012 18:45:17 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Fri May 25 20:45:15 2012 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SXzVi-0004FE-PO for geh-help-gnu-emacs@m.gmane.org; Fri, 25 May 2012 20:45:14 +0200 Original-Received: from localhost ([::1]:54089 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SXzVi-0007gu-Ao for geh-help-gnu-emacs@m.gmane.org; Fri, 25 May 2012 14:45:14 -0400 Original-Path: usenet.stanford.edu!postnews.google.com!kw17g2000pbb.googlegroups.com!not-for-mail Original-Newsgroups: gnu.emacs.help,comp.emacs Original-Lines: 92 Original-NNTP-Posting-Host: 76.126.112.84 Original-X-Trace: posting.google.com 1337971291 3533 127.0.0.1 (25 May 2012 18:41:31 GMT) Original-X-Complaints-To: groups-abuse@google.com Original-NNTP-Posting-Date: Fri, 25 May 2012 18:41:31 +0000 (UTC) Complaints-To: groups-abuse@google.com Injection-Info: kw17g2000pbb.googlegroups.com; posting-host=76.126.112.84; posting-account=bRPKjQoAAACxZsR8_VPXCX27T2YcsyMA User-Agent: G2/1.0 X-HTTP-UserAgent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.52 Safari/536.5,gzip(gfe) Original-Xref: usenet.stanford.edu gnu.emacs.help:192570 comp.emacs:102469 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:84975 Archived-At: hope Eli answered all your questions. here's some addition. =E2=80=A2 embrace unicode, because it's just going to be more and more. Programing Languages are all default on unicode by spec (e.g. any html/ css/JavaScript, and Java, Haskell, =E2=80=A6). Most OS (Windows, Mac) and f= ile systems all default to unicode encoding now (not sure about linux). Even emacs, starting with emacs 23, uses unicode as default internal encoding. =E3=80=88Unicode Popularity on Web by Google=E3=80=89 http://xahlee.org/comp/unicode_on_web.html =E2=80=A2 Unicode is about 2 things: =E2=91=A0 a char set with a integer ID= for each char. =E2=91=A1 several encoding for the char set, most popular being utf-8 and utf-16 (the latter are default on Mac, Windows). (encoding is a standard that changes a char from a char set into byte sequence) =E2=80=A2 in emacs, just put this in your init: (set-language-environment "UTF-8") that should put all encoding to utf-8, and shouldn't cause you any problem if all your curretn file and elisp file are ascii, because ascii encoding is compatible/subset of utf-8/unicode. =E2=80=A2 in emacs, call describe-car. That'll show the current char's encoding as well as byte sequence used for that particular encoding. (this is emacs 24. Emacs 23 may not show the byte sequence... i don't recall.) my unicode tutorial covers all these=E2=80=A6 feel free to ask me, or here,= of course. Xah On May 25, 6:40=C2=A0am, "Buchs, Kevin" wrote: > Thanks, Xah and Eli, for contributing to my further understanding. I > went to a specific website where I got the content I copied and pasted > and I can see from the HTML that it has a charset=3DUTF-8, so I understan= d > that is Unicode 8-bit. Using the C-u C-x =3D, I see that the particular > character I pasted has a code point of 0x2013 (U+2013). I didn't see, > however, what the UTF-8 encoding of that code point was. Should I be > able to read that somewhere on the buffer of information I get with C-u > C-x =3D ? I was poking around thewww.unicode.orgwebsite, trying to > understand how this U+2013 code point is encoded into UTF-8, but I > haven't determined that yet. > > A fresh buffer in emacs for me on my Win-7 box has an encoding system of > iso-latin-1-dos. The coding system used to open and save files is the > same. > > So, help me piece together what happens as I paste the UTF-8 text into a > buffer. First, the paste buffer must define that it is in UTF-8. Emacs > reads this information and inserts it into the byte string that defines > the buffer. Now, how does emacs record that it was a UTF-8 encoded > character? Does it translate it into a different internal encoding > instead of just recording the 8 bits transferred? Is this encoding used > as a superset of all possible encoding systems that emacs supports? > > Now, =C2=A0Xah, you suggest I embrace Unicode. What does that mean? Would= it > involve marking all my lisp library files and my org-mode files with the > file variable -*- coding: utf-8 -*- ? Or is there another way to go > Unicode automatically? > > I assume that if my lisp library files are encoded utf-8, then I can > paste that character from the web page into my call to replace-string in > order to substitute the longer dash of Unicode U+2013 with an ascii > hyphen or double hyphen. But, how does that really work? If the lisp > file is encoded utf-8, then how can I put an ascii character in the > replacement string? > > I would appreciate it if someone could help me open this new door in my > brain a bit further. > > Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 | > buchs.ke...@mayo.edu > Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |http://www.mayo.= edu/sppdg > > -----Original Message----- > > With cursor on that character, type "C-u C-x =3D", and Emacs will show > everything it knows about that character, including its canonical > name.