From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.help Subject: Re: character encoding question Date: Wed, 20 Feb 2013 09:42:49 -0500 Message-ID: References: <87mwuzze5s.fsf@ericabrahamsen.net> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1361371401 26094 80.91.229.3 (20 Feb 2013 14:43:21 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 20 Feb 2013 14:43:21 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Wed Feb 20 15:43:43 2013 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1U8Ata-0008A5-LD for geh-help-gnu-emacs@m.gmane.org; Wed, 20 Feb 2013 15:43:42 +0100 Original-Received: from localhost ([::1]:41382 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U8AtG-00081k-HD for geh-help-gnu-emacs@m.gmane.org; Wed, 20 Feb 2013 09:43:22 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:35302) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U8At5-00081L-UK for help-gnu-emacs@gnu.org; Wed, 20 Feb 2013 09:43:18 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1U8Asy-0000PR-LH for help-gnu-emacs@gnu.org; Wed, 20 Feb 2013 09:43:10 -0500 Original-Received: from plane.gmane.org ([80.91.229.3]:55167) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U8Asy-0000Om-FX for help-gnu-emacs@gnu.org; Wed, 20 Feb 2013 09:43:04 -0500 Original-Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1U8AtE-0007vi-Pu for help-gnu-emacs@gnu.org; Wed, 20 Feb 2013 15:43:20 +0100 Original-Received: from 108.161.117.233 ([108.161.117.233]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 20 Feb 2013 15:43:20 +0100 Original-Received: from monnier by 108.161.117.233 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 20 Feb 2013 15:43:20 +0100 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 34 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 108.161.117.233 User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (gnu/linux) Cancel-Lock: sha1:hPHZvLkG4XuDVNJLXqkKokBdLpM= X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 80.91.229.3 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:89180 Archived-At: > So the character 中 has a codepoint of #o47055 in octal notation. Internally, in your Emacs, yes. This value actually depends on the internal representation chosen by Emacs, which happens to be Unicode since Emacs-23 (and was something else before). > Meanwhile: > (string-as-unibyte "中") --> \344\270\255 This again shows the internal byte representation of this char inside a buffer, which is utf-8 since Emacs-23 and was something else before. Strong recommendation: stay far away from string-as-* because that will mess you up. You want instead to use encode-coding-string. E.g. (encode-coding-string "中" 'utf-8) ==> "\344\270\255" > What's the correspondence between these bytes and the multibyte > character's octal codepoint? #o47055 is not "multibyte". It's just its "name" aka "codepoint". "\344\270\255" is one if its multibyte encodings. > Are there any functions that will get from one to the other? (encode-coding-string (string #o47055) 'utf-8) ==> "\344\270\255" > Given a series of mystery bytes, can I test them against different > charsets, and see what gibberish Emacs comes up with? (decode-coding-string "\344\270\255" 'utf-8) ==> "中" -- Stefan