From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eric Abrahamsen Newsgroups: gmane.emacs.help Subject: character encoding question Date: Wed, 20 Feb 2013 14:34:55 +0800 Message-ID: <87mwuzze5s.fsf@ericabrahamsen.net> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1361341822 14471 80.91.229.3 (20 Feb 2013 06:30:22 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 20 Feb 2013 06:30:22 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Wed Feb 20 07:30:43 2013 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1U83CU-00086H-F7 for geh-help-gnu-emacs@m.gmane.org; Wed, 20 Feb 2013 07:30:42 +0100 Original-Received: from localhost ([::1]:42402 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U83C6-0003R5-U9 for geh-help-gnu-emacs@m.gmane.org; Wed, 20 Feb 2013 01:30:18 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:50302) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U83Bv-0003O3-6o for help-gnu-emacs@gnu.org; Wed, 20 Feb 2013 01:30:13 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1U83Bq-0003Lu-SJ for help-gnu-emacs@gnu.org; Wed, 20 Feb 2013 01:30:06 -0500 Original-Received: from plane.gmane.org ([80.91.229.3]:42466) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U83Bq-0003Ir-Lr for help-gnu-emacs@gnu.org; Wed, 20 Feb 2013 01:30:02 -0500 Original-Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1U83C8-0007t2-53 for help-gnu-emacs@gnu.org; Wed, 20 Feb 2013 07:30:20 +0100 Original-Received: from 114.250.128.179 ([114.250.128.179]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 20 Feb 2013 07:30:20 +0100 Original-Received: from eric by 114.250.128.179 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 20 Feb 2013 07:30:20 +0100 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 25 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 114.250.128.179 User-Agent: Gnus/5.130006 (Ma Gnus v0.6) Emacs/24.2 (gnu/linux) Cancel-Lock: sha1:I3EBJAIJhou4hiq7lKG3U75y17s= X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 80.91.229.3 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:89171 Archived-At: I'm trying to get a better understanding of character encodings, as I often have to deal with mis-encoded or mystery-encoded files. I've read the Non-ASCII Characters section of the elisp manual, and have a fair sense of what's going on, with a couple of remaining questions. So the character 中 has a codepoint of #o47055 in octal notation. Meanwhile: (string-as-unibyte "中") --> \344\270\255 I understand that each of these three sections is a byte, also in octal. What's the correspondence between these bytes and the multibyte character's octal codepoint? Are there any functions that will get from one to the other? Second question: If emacs can't guess the encoding of a file, it gives you an error message showing the bytes it can't decode, plus the charsets it tried to use. How do I replicate that process manually? Given a series of mystery bytes, can I test them against different charsets, and see what gibberish emacs comes up with? I guess I'm imagining something like "decode-char", except being able to feed it bytes instead of a character... Thanks! Eric