From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.bugs Subject: bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp Date: Mon, 05 Nov 2012 23:41:58 +0900 Message-ID: <87liegt8ll.fsf@gnu.org> References: NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1352126647 14400 80.91.229.3 (5 Nov 2012 14:44:07 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 5 Nov 2012 14:44:07 +0000 (UTC) Cc: 12803@debbugs.gnu.org To: Peter Dyballa Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Mon Nov 05 15:44:16 2012 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1TVNuS-00028W-By for geb-bug-gnu-emacs@m.gmane.org; Mon, 05 Nov 2012 15:44:16 +0100 Original-Received: from localhost ([::1]:57264 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TVNuJ-0003sj-FV for geb-bug-gnu-emacs@m.gmane.org; Mon, 05 Nov 2012 09:44:07 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:39811) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TVNuA-0003sK-1N for bug-gnu-emacs@gnu.org; Mon, 05 Nov 2012 09:44:05 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TVNu8-0007kI-S9 for bug-gnu-emacs@gnu.org; Mon, 05 Nov 2012 09:43:57 -0500 Original-Received: from debbugs.gnu.org ([140.186.70.43]:39782) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TVNu8-0007kD-GM for bug-gnu-emacs@gnu.org; Mon, 05 Nov 2012 09:43:56 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.72) (envelope-from ) id 1TVNx7-0003fi-Pb for bug-gnu-emacs@gnu.org; Mon, 05 Nov 2012 09:47:01 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Kenichi Handa Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 05 Nov 2012 14:47:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 12803 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: notabug Original-Received: via spool by 12803-submit@debbugs.gnu.org id=B12803.135212681314100 (code B ref 12803); Mon, 05 Nov 2012 14:47:01 +0000 Original-Received: (at 12803) by debbugs.gnu.org; 5 Nov 2012 14:46:53 +0000 Original-Received: from localhost ([127.0.0.1]:50033 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TVNwz-0003fN-0g for submit@debbugs.gnu.org; Mon, 05 Nov 2012 09:46:53 -0500 Original-Received: from fencepost.gnu.org ([208.118.235.10]:49507) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TVNww-0003fE-2B for 12803@debbugs.gnu.org; Mon, 05 Nov 2012 09:46:51 -0500 Original-Received: from 253.240.accsnet.ne.jp ([202.220.240.253]:53595 helo=mongkok) by fencepost.gnu.org with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1TVNtv-0001aO-Km; Mon, 05 Nov 2012 09:43:44 -0500 In-Reply-To: (message from Peter Dyballa on Sun, 4 Nov 2012 23:35:58 +0100) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 140.186.70.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:66478 Archived-At: In article , Peter Dyballa= writes: > I wanted to get the unique Thai characters from such an eMail subject: > FW:grcthai =E0=B8=AA=E0=B8=A3=E0=B9=89=E0=B8=B2=E0=B8=87=E0=B8=A3=E0=B8= =B2=E0=B8=A2=E0=B9=84=E0=B8=94=E0=B9=89=E0=B9=81=E0=B8=9A=E0=B8=9A=E0=B9=84= =E0=B8=A3=E0=B9=89=E0=B8=82=E0=B8=B5=E0=B8=94=E0=B8=88=E0=B8=B3=E0=B8=81=E0= =B8=B1=E0=B8=94 =E0=B8=81=E0=B8=B1=E0=B8=9A=E0=B8=81=E0=B8=B2=E0=B8=A3=E0= =B8=97=E0=B8=B3=E0=B8=87=E0=B8=B2=E0=B8=99=E0=B9=81=E0=B8=9A=E0=B8=9A=E0=B9= =84=E0=B8=A3=E0=B9=89=E0=B8=82=E0=B8=AD=E0=B8=9A=E0=B9=80=E0=B8=82=E0=B8=95= .. > So I marked the Thai text and invoked replace-regexp with "\(.\)" -> =E2= =80=9D\1 " to later do replace-string " " -> "C-qC-j" and then [g]sort -u t= he result. I had in buffer *Shell Command Output* decomposed Thai Unicode c= haracters=E2=80=A6 > But actually it is already the function replace-regexp which produces the= decomposed characters (originally 41 characters, after replace-regexp not = 82 but 89 according to column-number-mode). There's no such a character as "accented Thai Unicode character". Your example is not originally 41 characters, it's just originally 41 columns on display. For Thai, Unicode doesn't assign a character code, for instance, to "=E0=B8=A3=E0=B9=89". It's a two characters sequence, and on displaying, it's composed into one grapheme cluster occupying one column on display. The more strangely looking example is "=E0=B8=88=E0=B8=B3". It's a two characters sequence, but the first character is =E0=B8=88 and the second is =E0=B8=B3. Unicode doesn't have a character "=E0=B8=88 with small-circle-above". --- Kenichi Handa handa@gnu.org