* bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp
@ 2012-11-04 22:35 Peter Dyballa
2012-11-05 3:45 ` Stefan Monnier
2012-11-05 14:41 ` Kenichi Handa
0 siblings, 2 replies; 4+ messages in thread
From: Peter Dyballa @ 2012-11-04 22:35 UTC (permalink / raw)
To: 12803
Hello!
I wanted to get the unique Thai characters from such an eMail subject:
FW:grcthai สร้างรายได้แบบไร้ขีดจำกัด กับการทำงานแบบไร้ขอบเขต..
So I marked the Thai text and invoked replace-regexp with "\(.\)" -> ”\1 " to later do replace-string " " -> "C-qC-j" and then [g]sort -u the result. I had in buffer *Shell Command Output* decomposed Thai Unicode characters…
But actually it is already the function replace-regexp which produces the decomposed characters (originally 41 characters, after replace-regexp not 82 but 89 according to column-number-mode).
Mac OS X 10.6.8; the fonts used are FreeSerif for the Thai characters, George Williams' Monospace Regular is used for SPACE. The result is the same when I use GTK2 and it also make no difference when I use a native 64-bit binary (and libs).
In GNU Emacs 24.3.50.1 (i386-apple-darwin10.8.0, X toolkit, Xaw3d scroll bars)
of 2012-11-04 on Sumac.local
Bzr revision: 110798 eggert@cs.ucla.edu-20121104172952-vvhdy8gmbtgj0c3w
Windowing system distributor `The X.Org Foundation', version 11.0.11300000
Configured using:
`configure '--build=x86_64-apple-darwin10.8.0'
'--host=i386-apple-darwin10.8.0' '--target=i386-apple-darwin10.8.0'
'--without-pop' '--without-sound' '--without-gpm' '--without-dbus'
'--without-selinux' '--with-x-toolkit=athena'
'--disable-ns-self-contained' '--without-xpm' '--without-jpeg'
'--without-tiff' '--without-gif' '--without-png'
'--x-libraries=/usr/X11/lib' '--x-includes=/usr/X11/include'
'--enable-locallisppath=/Library/Application
Support/Emacs/calendar24:/Library/Application Support/Emacs'
'CFLAGS=-g3 -H -pipe -fPIC -fno-common -Os -march=core2 -mtune=core2
-m32 -fomit-frame-pointer -msse4.2' 'LDFLAGS=-m32
-Wl,-dead_strip_dylibs -Wl,-bind_at_load -Wl,-t'
'CPPFLAGS=-I/sw/include' 'CC=clang' 'CXX=clang++'
'PKG_CONFIG_PATH=/sw/lib/xft2/lib/pkgconfig:/sw/share/pkgconfig:/sw/lib/pkgconfig:/usr/X11/lib/pkgconfig:/usr/X11/share/pkgconfig:/usr/lib/pkgconfig'
'build_alias=x86_64-apple-darwin10.8.0'
'host_alias=i386-apple-darwin10.8.0'
'target_alias=i386-apple-darwin10.8.0''
Important settings:
value of $LC_CTYPE: de_DE.UTF-8
value of $LANG: de_DE.UTF-8
locale-coding-system: utf-8-unix
default enable-multibyte-characters: t
Major mode: Lisp Interaction
Minor modes in effect:
tooltip-mode: t
mouse-wheel-mode: t
tool-bar-mode: t
menu-bar-mode: t
file-name-shadow-mode: t
global-font-lock-mode: t
font-lock-mode: t
blink-cursor-mode: t
auto-composition-mode: t
auto-encryption-mode: t
auto-compression-mode: t
column-number-mode: t
line-number-mode: t
transient-mark-mode: t
Recent input:
<down-mouse-1> <mouse-1> <help-echo> <down-mouse-1>
<mouse-1> <down-mouse-2> <mouse-2> <down-mouse-1> <mouse-1>
<backspace> C-a <escape> x r e p l <tab> r e g <tab>
<return> \ ( . \ ) <return> \ 1 SPC <return> M-x c
o l <tab> <return> C-a C-u C-x = <right> C-u C-x =
<right> C-u C-x = <right> <right> C-u C-x = <help-echo>
<help-echo> <help-echo> <help-echo> <help-echo> <help-echo>
<help-echo> <help-echo> <menu-bar> <help-menu> <send-emacs-bug-report>
Recent messages:
Replaced 48 occurrences
Column-Number mode enabled
Type C-x 1 to delete the help window, C-M-v to scroll help.
Char: ส (3626, #o7052, #xe2a, file ...) point=192 of 287 (67%) column=0
Char: SPC (32, #o40, #x20) point=193 of 287 (67%) column=1
Char: ร (3619, #o7043, #xe23, file ...) point=194 of 287 (67%) column=2
Char: ้ (3657, #o7111, #xe49, file ...) point=196 of 287 (68%) column=4
Load-path shadows:
None found.
Features:
(shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml
mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev
gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util
mail-prsvr mail-utils pp wid-edit descr-text help-mode easymenu
cus-start cus-load thai-util thai-word mule-util time-date tooltip
ediff-hook vc-hooks lisp-float-type mwheel x-win x-dnd tool-bar dnd
fontset image regexp-opt fringe tabulated-list newcomment lisp-mode
register page menu-bar rfn-eshadow timer select scroll-bar mouse
jit-lock font-lock syntax facemenu font-core frame cham georgian
utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean
japanese hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese case-table epa-hook jka-cmpr-hook help simple abbrev
minibuffer loaddefs button faces cus-face macroexp files text-properties
overlay sha1 md5 base64 format env code-pages mule custom widget
hashtable-print-readable backquote make-network-process dynamic-setting
system-font-setting font-render-setting x-toolkit x multi-tty emacs)
--
Greetings
Pete
The problem with the French is that they don't have a word for « entrepreneur ».
– Georges W. Bush
^ permalink raw reply [flat|nested] 4+ messages in thread
* bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp
2012-11-04 22:35 bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp Peter Dyballa
@ 2012-11-05 3:45 ` Stefan Monnier
2012-11-05 9:49 ` Peter Dyballa
2012-11-05 14:41 ` Kenichi Handa
1 sibling, 1 reply; 4+ messages in thread
From: Stefan Monnier @ 2012-11-05 3:45 UTC (permalink / raw)
To: Peter Dyballa; +Cc: 12803
tags 12803 notabug
thanks
> But actually it is already the function replace-regexp which produces
> the decomposed characters (originally 41 characters, after
> replace-regexp not 82 but 89 according to column-number-mode).
Composition is done on-the-fly in the display code, so you're seeing
the expected.
Stefan
^ permalink raw reply [flat|nested] 4+ messages in thread
* bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp
2012-11-05 3:45 ` Stefan Monnier
@ 2012-11-05 9:49 ` Peter Dyballa
0 siblings, 0 replies; 4+ messages in thread
From: Peter Dyballa @ 2012-11-05 9:49 UTC (permalink / raw)
To: Stefan Monnier; +Cc: 12803
[-- Attachment #1: Type: text/plain, Size: 407 bytes --]
Am 05.11.2012 um 04:45 schrieb Stefan Monnier:
> tags 12803 notabug
> thanks
>
>> But actually it is already the function replace-regexp which produces
>> the decomposed characters (originally 41 characters, after
>> replace-regexp not 82 but 89 according to column-number-mode).
>
> Composition is done on-the-fly in the display code, so you're seeing
> the expected.
>
What I see is no composition:
[-- Attachment #2: Thai Unicode de-composed.png --]
[-- Type: image/png, Size: 8791 bytes --]
[-- Attachment #3: Type: text/plain, Size: 629 bytes --]
The upper line shows the original text from the subject line, the lower line shows the decimated and decomposed result. I see almost no composition (just อั and ำ) – because accent and character were separated by sorting. Do I need to customise anything in order to keep (or to receive) the composed forms?
(I think I've had the same before with Cyrillic texts.)
--
Greetings
Pete <]
o __o |__ o HPV, the real
___o /I -\<, |o \ -\),-% high speed!
___/\ /\___./ \___...O/ O____.....`-O-'-()--o_________________
^ permalink raw reply [flat|nested] 4+ messages in thread
* bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp
2012-11-04 22:35 bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp Peter Dyballa
2012-11-05 3:45 ` Stefan Monnier
@ 2012-11-05 14:41 ` Kenichi Handa
1 sibling, 0 replies; 4+ messages in thread
From: Kenichi Handa @ 2012-11-05 14:41 UTC (permalink / raw)
To: Peter Dyballa; +Cc: 12803
In article <DF4C7EEF-CE55-4363-A91A-0577DD28AEED@freenet.de>, Peter Dyballa <peter_dyballa@freenet.de> writes:
> I wanted to get the unique Thai characters from such an eMail subject:
> FW:grcthai สร้างรายได้แบบไร้ขีดจำกัด กับการทำงานแบบไร้ขอบเขต..
> So I marked the Thai text and invoked replace-regexp with "\(.\)" -> ”\1 " to later do replace-string " " -> "C-qC-j" and then [g]sort -u the result. I had in buffer *Shell Command Output* decomposed Thai Unicode characters…
> But actually it is already the function replace-regexp which produces the decomposed characters (originally 41 characters, after replace-regexp not 82 but 89 according to column-number-mode).
There's no such a character as "accented Thai Unicode character".
Your example is not originally 41 characters, it's just
originally 41 columns on display.
For Thai, Unicode doesn't assign a character code, for
instance, to "ร้". It's a two characters sequence, and on
displaying, it's composed into one grapheme cluster
occupying one column on display.
The more strangely looking example is "จำ". It's a two
characters sequence, but the first character is จ and the
second is ำ. Unicode doesn't have a character "จ with
small-circle-above".
---
Kenichi Handa
handa@gnu.org
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2012-11-05 14:41 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-04 22:35 bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp Peter Dyballa
2012-11-05 3:45 ` Stefan Monnier
2012-11-05 9:49 ` Peter Dyballa
2012-11-05 14:41 ` Kenichi Handa
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.