unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp
@ 2012-11-04 22:35 Peter Dyballa
  2012-11-05  3:45 ` Stefan Monnier
  2012-11-05 14:41 ` Kenichi Handa
  0 siblings, 2 replies; 4+ messages in thread
From: Peter Dyballa @ 2012-11-04 22:35 UTC (permalink / raw)
  To: 12803

Hello!

I wanted to get the unique Thai characters from such an eMail subject:

	FW:grcthai สร้างรายได้แบบไร้ขีดจำกัด กับการทำงานแบบไร้ขอบเขต..

So I marked the Thai text and invoked replace-regexp with "\(.\)" -> ”\1 " to later do replace-string " " -> "C-qC-j" and then [g]sort -u the result. I had in buffer *Shell Command Output* decomposed Thai Unicode characters…

But actually it is already the function replace-regexp which produces the decomposed characters (originally 41 characters, after replace-regexp not 82 but 89 according to column-number-mode).

Mac OS X 10.6.8; the fonts used are FreeSerif for the Thai characters,  George Williams' Monospace Regular is used for SPACE. The result is the same when I use GTK2 and it also make no difference when I use a native 64-bit binary (and libs).


In GNU Emacs 24.3.50.1 (i386-apple-darwin10.8.0, X toolkit, Xaw3d scroll bars)
 of 2012-11-04 on Sumac.local
Bzr revision: 110798 eggert@cs.ucla.edu-20121104172952-vvhdy8gmbtgj0c3w
Windowing system distributor `The X.Org Foundation', version 11.0.11300000
Configured using:
 `configure '--build=x86_64-apple-darwin10.8.0'
 '--host=i386-apple-darwin10.8.0' '--target=i386-apple-darwin10.8.0'
 '--without-pop' '--without-sound' '--without-gpm' '--without-dbus'
 '--without-selinux' '--with-x-toolkit=athena'
 '--disable-ns-self-contained' '--without-xpm' '--without-jpeg'
 '--without-tiff' '--without-gif' '--without-png'
 '--x-libraries=/usr/X11/lib' '--x-includes=/usr/X11/include'
 '--enable-locallisppath=/Library/Application
 Support/Emacs/calendar24:/Library/Application Support/Emacs'
 'CFLAGS=-g3 -H -pipe -fPIC -fno-common -Os -march=core2 -mtune=core2
 -m32 -fomit-frame-pointer -msse4.2' 'LDFLAGS=-m32
 -Wl,-dead_strip_dylibs -Wl,-bind_at_load -Wl,-t'
 'CPPFLAGS=-I/sw/include' 'CC=clang' 'CXX=clang++'
 'PKG_CONFIG_PATH=/sw/lib/xft2/lib/pkgconfig:/sw/share/pkgconfig:/sw/lib/pkgconfig:/usr/X11/lib/pkgconfig:/usr/X11/share/pkgconfig:/usr/lib/pkgconfig'
 'build_alias=x86_64-apple-darwin10.8.0'
 'host_alias=i386-apple-darwin10.8.0'
 'target_alias=i386-apple-darwin10.8.0''

Important settings:
  value of $LC_CTYPE: de_DE.UTF-8
  value of $LANG: de_DE.UTF-8
  locale-coding-system: utf-8-unix
  default enable-multibyte-characters: t

Major mode: Lisp Interaction

Minor modes in effect:
  tooltip-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  column-number-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent input:
<down-mouse-1> <mouse-1> <help-echo> <down-mouse-1> 
<mouse-1> <down-mouse-2> <mouse-2> <down-mouse-1> <mouse-1> 
<backspace> C-a <escape> x r e p l <tab> r e g <tab> 
<return> \ ( . \ ) <return> \ 1 SPC <return> M-x c 
o l <tab> <return> C-a C-u C-x = <right> C-u C-x = 
<right> C-u C-x = <right> <right> C-u C-x = <help-echo> 
<help-echo> <help-echo> <help-echo> <help-echo> <help-echo> 
<help-echo> <help-echo> <menu-bar> <help-menu> <send-emacs-bug-report>

Recent messages:
Replaced 48 occurrences
Column-Number mode enabled
Type C-x 1 to delete the help window, C-M-v to scroll help.
Char: ส (3626, #o7052, #xe2a, file ...) point=192 of 287 (67%) column=0

Char: SPC (32, #o40, #x20) point=193 of 287 (67%) column=1

Char: ร (3619, #o7043, #xe23, file ...) point=194 of 287 (67%) column=2

Char: ้ (3657, #o7111, #xe49, file ...) point=196 of 287 (68%) column=4

Load-path shadows:
None found.

Features:
(shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml
mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev
gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util
mail-prsvr mail-utils pp wid-edit descr-text help-mode easymenu
cus-start cus-load thai-util thai-word mule-util time-date tooltip
ediff-hook vc-hooks lisp-float-type mwheel x-win x-dnd tool-bar dnd
fontset image regexp-opt fringe tabulated-list newcomment lisp-mode
register page menu-bar rfn-eshadow timer select scroll-bar mouse
jit-lock font-lock syntax facemenu font-core frame cham georgian
utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean
japanese hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese case-table epa-hook jka-cmpr-hook help simple abbrev
minibuffer loaddefs button faces cus-face macroexp files text-properties
overlay sha1 md5 base64 format env code-pages mule custom widget
hashtable-print-readable backquote make-network-process dynamic-setting
system-font-setting font-render-setting x-toolkit x multi-tty emacs)


--
Greetings

  Pete

The problem with the French is that they don't have a word for « entrepreneur ».
				– Georges W. Bush






^ permalink raw reply	[flat|nested] 4+ messages in thread

* bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp
  2012-11-04 22:35 bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp Peter Dyballa
@ 2012-11-05  3:45 ` Stefan Monnier
  2012-11-05  9:49   ` Peter Dyballa
  2012-11-05 14:41 ` Kenichi Handa
  1 sibling, 1 reply; 4+ messages in thread
From: Stefan Monnier @ 2012-11-05  3:45 UTC (permalink / raw)
  To: Peter Dyballa; +Cc: 12803

tags 12803 notabug
thanks

> But actually it is already the function replace-regexp which produces
> the decomposed characters (originally 41 characters, after
> replace-regexp not 82 but 89 according to column-number-mode).

Composition is done on-the-fly in the display code, so you're seeing
the expected.


        Stefan





^ permalink raw reply	[flat|nested] 4+ messages in thread

* bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp
  2012-11-05  3:45 ` Stefan Monnier
@ 2012-11-05  9:49   ` Peter Dyballa
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Dyballa @ 2012-11-05  9:49 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 12803

[-- Attachment #1: Type: text/plain, Size: 407 bytes --]


Am 05.11.2012 um 04:45 schrieb Stefan Monnier:

> tags 12803 notabug
> thanks
> 
>> But actually it is already the function replace-regexp which produces
>> the decomposed characters (originally 41 characters, after
>> replace-regexp not 82 but 89 according to column-number-mode).
> 
> Composition is done on-the-fly in the display code, so you're seeing
> the expected.
> 
What I see is no composition:


[-- Attachment #2: Thai Unicode de-composed.png --]
[-- Type: image/png, Size: 8791 bytes --]

[-- Attachment #3: Type: text/plain, Size: 629 bytes --]



The upper line shows the original text from the subject line, the lower line shows the decimated and decomposed result. I see almost no composition (just อั and ำ) – because accent and character were separated by sorting. Do I need to customise anything in order to keep (or to receive) the composed forms?

(I think I've had the same before with Cyrillic texts.)

--
Greetings

  Pete                           <]
             o        __o         |__    o       HPV, the real
    ___o    /I       -\<,         |o \  -\),-%     high speed!
___/\ /\___./ \___...O/ O____.....`-O-'-()--o_________________


^ permalink raw reply	[flat|nested] 4+ messages in thread

* bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp
  2012-11-04 22:35 bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp Peter Dyballa
  2012-11-05  3:45 ` Stefan Monnier
@ 2012-11-05 14:41 ` Kenichi Handa
  1 sibling, 0 replies; 4+ messages in thread
From: Kenichi Handa @ 2012-11-05 14:41 UTC (permalink / raw)
  To: Peter Dyballa; +Cc: 12803

In article <DF4C7EEF-CE55-4363-A91A-0577DD28AEED@freenet.de>, Peter Dyballa <peter_dyballa@freenet.de> writes:

> I wanted to get the unique Thai characters from such an eMail subject:

> 	FW:grcthai สร้างรายได้แบบไร้ขีดจำกัด กับการทำงานแบบไร้ขอบเขต..

> So I marked the Thai text and invoked replace-regexp with "\(.\)" -> ”\1 " to later do replace-string " " -> "C-qC-j" and then [g]sort -u the result. I had in buffer *Shell Command Output* decomposed Thai Unicode characters…

> But actually it is already the function replace-regexp which produces the decomposed characters (originally 41 characters, after replace-regexp not 82 but 89 according to column-number-mode).

There's no such a character as "accented Thai Unicode character".

Your example is not originally 41 characters, it's just
originally 41 columns on display.

For Thai, Unicode doesn't assign a character code, for
instance, to "ร้".  It's a two characters sequence, and on
displaying, it's composed into one grapheme cluster
occupying one column on display.

The more strangely looking example is "จำ".  It's a two
characters sequence, but the first character is จ and the
second is ำ.  Unicode doesn't have a character "จ with
small-circle-above".

---
Kenichi Handa
handa@gnu.org





^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-11-05 14:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-04 22:35 bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp Peter Dyballa
2012-11-05  3:45 ` Stefan Monnier
2012-11-05  9:49   ` Peter Dyballa
2012-11-05 14:41 ` Kenichi Handa

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).