From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.help Subject: Re: those funny non-ASCII characters Date: Fri, 25 May 2012 17:04:00 +0300 Message-ID: <83r4u8we5r.fsf@gnu.org> References: NNTP-Posting-Host: plane.gmane.org X-Trace: dough.gmane.org 1337954663 3949 80.91.229.3 (25 May 2012 14:04:23 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 25 May 2012 14:04:23 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Fri May 25 16:04:21 2012 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SXv7p-0005w7-SX for geh-help-gnu-emacs@m.gmane.org; Fri, 25 May 2012 16:04:18 +0200 Original-Received: from localhost ([::1]:35256 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SXv7p-0002gL-Fo for geh-help-gnu-emacs@m.gmane.org; Fri, 25 May 2012 10:04:17 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:38520) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SXv7f-0002en-EC for help-gnu-emacs@gnu.org; Fri, 25 May 2012 10:04:12 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1SXv7a-0006OD-Aw for help-gnu-emacs@gnu.org; Fri, 25 May 2012 10:04:07 -0400 Original-Received: from mtaout23.012.net.il ([80.179.55.175]:45904) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SXv7a-0006Nu-29 for help-gnu-emacs@gnu.org; Fri, 25 May 2012 10:04:02 -0400 Original-Received: from conversion-daemon.a-mtaout23.012.net.il by a-mtaout23.012.net.il (HyperSendmail v2007.08) id <0M4L001000E7VJ00@a-mtaout23.012.net.il> for help-gnu-emacs@gnu.org; Fri, 25 May 2012 17:04:00 +0300 (IDT) Original-Received: from HOME-C4E4A596F7 ([87.69.210.75]) by a-mtaout23.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0M4L0017G0ENTN60@a-mtaout23.012.net.il> for help-gnu-emacs@gnu.org; Fri, 25 May 2012 17:04:00 +0300 (IDT) In-reply-to: X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 (beta) X-Received-From: 80.179.55.175 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:84971 Archived-At: > Date: Fri, 25 May 2012 08:40:25 -0500 > From: "Buchs, Kevin" > > Thanks, Xah and Eli, for contributing to my further understanding. I > went to a specific website where I got the content I copied and pasted > and I can see from the HTML that it has a charset=UTF-8, so I understand > that is Unicode 8-bit. Using the C-u C-x =, I see that the particular > character I pasted has a code point of 0x2013 (U+2013). I didn't see, > however, what the UTF-8 encoding of that code point was. Should I be > able to read that somewhere on the buffer of information I get with C-u > C-x = ? Yes, this part of "C-u C-x ="'s display: file code: #xE2 #x80 #x93 (encoded by coding system utf-8-dos) shows you how it would be encoded in UTF-8. If you see something like "not encodable by ...", then you need to set the buffer's encoding using "C-x RET f". Under "file code", Emacs shows how the character would be encoded if the buffer is saved to a disk file or sent to another program or as an email message. > I was poking around the www.unicode.org website, trying to > understand how this U+2013 code point is encoded into UTF-8, but I > haven't determined that yet. See above: Emacs shows this under the right circumstances. > So, help me piece together what happens as I paste the UTF-8 text into a > buffer. First, the paste buffer must define that it is in UTF-8. On Windows, Emacs always uses UTF-16 to pass text via the clipboard, because doing so lets Emacs copy and paste any character from any character set on Earth. > Emacs reads this information and inserts it into the byte string > that defines the buffer. Now, how does emacs record that it was a > UTF-8 encoded character? It doesn't. What it records is the encoding to be used for the current buffer if it is saved to disk or sent to some program. That encoding is a property of the buffer, not of the characters. > Does it translate it into a different internal encoding Yes, it does. > Is this encoding used > as a superset of all possible encoding systems that emacs supports? Yes. See the section "Text Representations" in the ELisp manual that comes with Emacs, you will find the details there.