find-composition still depends on the composition property

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* find-composition still depends on the composition property
@ 2008-08-29 13:46 Juanma Barranquero
  2008-09-05  1:24 ` Kenichi Handa
  0 siblings, 1 reply; 24+ messages in thread
From: Juanma Barranquero @ 2008-08-29 13:46 UTC (permalink / raw)
  To: Emacs Development, Kenichi Handa

So, for example, C-u M-x describe-char does not describe composition
of characters now.


        character: ಸ (3256, #o6270, #xcb8)
preferred charset: mule-unicode-0100-24ff (Unicode characters of the
range U+0100..U+24FF.)
       code point: 0x3F38
           syntax: w 	which means: word
      buffer code: #xE0 #xB2 #xB8
        file code: ESC #x24 #x2C #x31 #x3F #x38 (encoded by coding
system iso-2022-7bit-dos)
          display: by this font (glyph code)
    uniscribe:-outline-Tunga-normal-normal-normal-*-13-*-*-*-p-*-iso10646-1
(#x63)


vs. the output before the change:


        character: ಸ (3256, #o6270, #xcb8)
preferred charset: mule-unicode-0100-24ff (Unicode characters of the
range U+0100..U+24FF.)
       code point: 0x3F38
           syntax: w 	which means: word
      buffer code: #xE0 #xB2 #xB8
        file code: ESC #x24 #x2C #x31 #x3F #x38 (encoded by coding
system iso-2022-7bit-dos)
          display: composed to form "ಸ್ಕಾ" (see below)

Composed with the following character(s) "್ಕಾ" using this font:
  uniscribe:-outline-Tunga-normal-normal-normal-*-13-*-*-*-p-*-iso10646-1
by these glyphs:
  [#<font-object
"-outline-Tunga-normal-normal-normal-*-13-*-*-*-p-*-iso10646-1"> 18 8
13 13 9]
  [0 3 3256 177 7 8 0 13 9 nil]
  [0 3 3256 101 8 8 0 13 9 nil]
  [0 3 3256 180 3 3 -2 13 9 nil]


 Juanma

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: find-composition still depends on the composition property
  2008-08-29 13:46 find-composition still depends on the composition property Juanma Barranquero
@ 2008-09-05  1:24 ` Kenichi Handa
  2008-10-19 23:15   ` Juri Linkov
  0 siblings, 1 reply; 24+ messages in thread
From: Kenichi Handa @ 2008-09-05  1:24 UTC (permalink / raw)
  To: Juanma Barranquero; +Cc: emacs-devel

In article <f7ccd24b0808290646r7ce000aet3aa6af5a1315b9d3@mail.gmail.com>, "Juanma Barranquero" <lekktu@gmail.com> writes:

> So, for example, C-u M-x describe-char does not describe composition
> of characters now.

I've just installed a fix.

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: find-composition still depends on the composition property
  2008-09-05  1:24 ` Kenichi Handa
@ 2008-10-19 23:15   ` Juri Linkov
  2008-10-20  6:46     ` Kenichi Handa
  0 siblings, 1 reply; 24+ messages in thread
From: Juri Linkov @ 2008-10-19 23:15 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: Juanma Barranquero, emacs-devel

>> So, for example, C-u M-x describe-char does not describe composition
>> of characters now.
>
> I've just installed a fix.

What do you think about displaying the Unicode information of the
combining character too?

Currently there is no easy way to display the information (name, category)
about the combining character because the user can't put point on the
combining character and type `C-u C-x ='.

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: find-composition still depends on the composition property
  2008-10-19 23:15   ` Juri Linkov
@ 2008-10-20  6:46     ` Kenichi Handa
  2008-10-21 23:46       ` Juri Linkov
  0 siblings, 1 reply; 24+ messages in thread
From: Kenichi Handa @ 2008-10-20  6:46 UTC (permalink / raw)
  To: Juri Linkov; +Cc: lekktu, emacs-devel

In article <87tzbh7kd9.fsf@jurta.org>, Juri Linkov <juri@jurta.org> writes:

> What do you think about displaying the Unicode information of the
> combining character too?

> Currently there is no easy way to display the information (name, category)
> about the combining character because the user can't put point on the
> combining character and type `C-u C-x ='.

When you type C-u C-x = on a composed characters, Emacs
displays how it is composed.  For instance, when a buffer
contains the sequence "a" "U+300" "U+316" composed into one
glyph, and you type C-u C-x = on it, Emacs shows something
like this information.

[...]
Composed with the following character(s) "̖́" using this font:
                                         ^^^^
  xft:-Misc-Fixed-normal-normal-normal-*-20-*-*-*-c-*-iso10646-1
by these glyphs:
[...]

As each character ocupies one column in the part "̖́", you can
easily put cursor on any of a character and type C-u C-x =.

Isn't it good enough?

---
Kenichi Handa
handa@ni.aist.go.jp

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: find-composition still depends on the composition property
  2008-10-20  6:46     ` Kenichi Handa
@ 2008-10-21 23:46       ` Juri Linkov
  2008-10-22  1:17         ` Kenichi Handa
  0 siblings, 1 reply; 24+ messages in thread
From: Juri Linkov @ 2008-10-21 23:46 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

> As each character ocupies one column in the part "̖́", you can
> easily put cursor on any of a character and type C-u C-x =.
>
> Isn't it good enough?

Thanks, I see what you mean.  Since looking at the information about
combining characters is a rare need, I believe typing `C-u C-x =' twice
(on the whole character and later on the combining character) is good enough.

I also have another question about composition that seems like a bug.
Please open the HELLO file and put point at the end of the line with the
Russian sample text.  Now type M-b `backward-word' and see that point
stops inside the whole word on the composed character, i.e. the composed
character is not word-constituent now.

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: find-composition still depends on the composition property
  2008-10-21 23:46       ` Juri Linkov
@ 2008-10-22  1:17         ` Kenichi Handa
  2008-10-22  4:25           ` Eli Zaretskii
  2008-10-22  5:29           ` Kenichi Handa
  0 siblings, 2 replies; 24+ messages in thread
From: Kenichi Handa @ 2008-10-22  1:17 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

In article <87tzb5ikrw.fsf@jurta.org>, Juri Linkov <juri@jurta.org> writes:

> I also have another question about composition that seems like a bug.
> Please open the HELLO file and put point at the end of the line with the
> Russian sample text.  Now type M-b `backward-word' and see that point
> stops inside the whole word on the composed character, i.e. the composed
> character is not word-constituent now.

Ah, it's not a bug of composition, but a bug of scan_words
(syntax.c).  Currently U+301 is labeled as `latin' script,
and the surrounding characters there are `cyrillic' script.
Thus, that funciton thinks that there's a word boundary.
I'll find a way to solve this problem.

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: find-composition still depends on the composition property
  2008-10-22  1:17         ` Kenichi Handa
@ 2008-10-22  4:25           ` Eli Zaretskii
  2008-10-22  5:43             ` Kenichi Handa
  2008-10-22  5:29           ` Kenichi Handa
  1 sibling, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2008-10-22  4:25 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: juri, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Date: Wed, 22 Oct 2008 10:17:57 +0900
> Cc: emacs-devel@gnu.org
> 
> Ah, it's not a bug of composition, but a bug of scan_words
> (syntax.c).  Currently U+301 is labeled as `latin' script,
> and the surrounding characters there are `cyrillic' script.
> Thus, that funciton thinks that there's a word boundary.
> I'll find a way to solve this problem.

I think we will need a user option to control whether scan_words stops
on script boundaries, see the discussion started by Miles yesterday
about a similar problem.




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: find-composition still depends on the composition property
  2008-10-22  1:17         ` Kenichi Handa
  2008-10-22  4:25           ` Eli Zaretskii
@ 2008-10-22  5:29           ` Kenichi Handa
  2008-10-22 19:35             ` Eli Zaretskii
  1 sibling, 1 reply; 24+ messages in thread
From: Kenichi Handa @ 2008-10-22  5:29 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: juri, emacs-devel

In article <E1KsSMH-0005T2-Ek@etlken.m17n.org>, Kenichi Handa <handa@m17n.org> writes:

> Ah, it's not a bug of composition, but a bug of scan_words
> (syntax.c).  Currently U+301 is labeled as `latin' script,
> and the surrounding characters there are `cyrillic' script.
> Thus, that funciton thinks that there's a word boundary.
> I'll find a way to solve this problem.

I've just installed a fix.

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: find-composition still depends on the composition property
  2008-10-22  4:25           ` Eli Zaretskii
@ 2008-10-22  5:43             ` Kenichi Handa
  0 siblings, 0 replies; 24+ messages in thread
From: Kenichi Handa @ 2008-10-22  5:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: juri, emacs-devel

In article <ubpxdfe7m.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > Ah, it's not a bug of composition, but a bug of scan_words
> > (syntax.c).  Currently U+301 is labeled as `latin' script,
> > and the surrounding characters there are `cyrillic' script.
> > Thus, that funciton thinks that there's a word boundary.
> > I'll find a way to solve this problem.

> I think we will need a user option to control whether scan_words stops
> on script boundaries, see the discussion started by Miles yesterday
> about a similar problem.

I've just noticed that thread, and now reading them.

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: find-composition still depends on the composition property
  2008-10-22  5:29           ` Kenichi Handa
@ 2008-10-22 19:35             ` Eli Zaretskii
  2008-10-23  1:18               ` Kenichi Handa
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2008-10-22 19:35 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: juri, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Date: Wed, 22 Oct 2008 14:29:47 +0900
> Cc: juri@jurta.org, emacs-devel@gnu.org
> 
> In article <E1KsSMH-0005T2-Ek@etlken.m17n.org>, Kenichi Handa <handa@m17n.org> writes:
> 
> > Ah, it's not a bug of composition, but a bug of scan_words
> > (syntax.c).  Currently U+301 is labeled as `latin' script,
> > and the surrounding characters there are `cyrillic' script.
> > Thus, that funciton thinks that there's a word boundary.
> > I'll find a way to solve this problem.
> 
> I've just installed a fix.

Thanks, but Emacs still does not get this quite right.  For example,
in the following line:

  אבגדה12345

Which mixes Hebrew letters with digits, M-f stops at the first digit,
whereas in this line:

  abcde12345

it does not.  The latter behavior is correct, the former is not.  (I'm
ashamed to admit that even MS Word gets it right.)

I understand that the way for fixing this would be to install more
entries in word-combining-categories, but more infrastructure seems to
be missing, since right now no characters have the "Hebrew" category,
for example (at least judging by the output of describe-categories).

By the way, I'd suggest to move the legend generated by
describe-categories to the beginning of the buffer, because the buffer
is huge and it does not say anywhere at the beginning that there's a
legend at the end.  Without the legend, the buffer looks like a large
pile of gibberish.

And another wish: can we have word-combining-categories and
word-separating-categories display their elements with human-readable
letters, not as their ASCII codes?  (Quick: what letter is code 94?)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: find-composition still depends on the composition property
  2008-10-22 19:35             ` Eli Zaretskii
@ 2008-10-23  1:18               ` Kenichi Handa
  2008-10-23 23:44                 ` describe-categories (was: find-composition still depends on the composition property) Juri Linkov
  2008-10-23 23:48                 ` Word boundary " Juri Linkov
  0 siblings, 2 replies; 24+ messages in thread
From: Kenichi Handa @ 2008-10-23  1:18 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: juri, emacs-devel

In article <uwsg0e837.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> Thanks, but Emacs still does not get this quite right.  For example,
> in the following line:

>   אבגדה12345

> Which mixes Hebrew letters with digits, M-f stops at the first digit,
> whereas in this line:

>   abcde12345

> it does not.  The latter behavior is correct, the former is not.  (I'm
> ashamed to admit that even MS Word gets it right.)

> I understand that the way for fixing this would be to install more
> entries in word-combining-categories, but more infrastructure seems to
> be missing, since right now no characters have the "Hebrew" category,
> for example (at least judging by the output of describe-categories).

Then what to do is:

(1-1) assign the category "6" (digit) to "0123456789".
(1-2) define a category, say "D", and assign it to all
characters that have no word-boundary between digits.
(1-3) add (?D . ?6) and (?6 . ?D) to word-combining-categories.

Another way is:

(2-1) modify word_boundary_p to handle negative category mnemonic in
word-*-categories to catch a character that doesn't have the
specified category.
(2-2) assign the category "6" (digit) to "0123456789".
(2-3) define a category, say "X", and assign it to all
characters that have word-boundary between digits.
(2-4) add ((- ?X) . ?6) and (?6 . (- ?X)) to
word-combining-categories.

Or,

(3-1) Make `common' script and classify digits, etc to it.
(3-2) modify word_boundary_p not to distinguish `common' from
any other script.
(3-3) define a category, say "X", and assign it to all
characters that have word-boundary between digits.
(3-4) add (?X . ?6) and (?6 . ?X) to
word-separating-categories.

> By the way, I'd suggest to move the legend generated by
> describe-categories to the beginning of the buffer, because the buffer
> is huge and it does not say anywhere at the beginning that there's a
> legend at the end.  Without the legend, the buffer looks like a large
> pile of gibberish.

The legend is longer than 40 lines.  If we put that at the
head, it will occupy the whole first page, which I think is
not that good.  Saying something like "See the end of the
buffer for the legend." with "legend" clickable at the first
line will be good.  What do you think?

> And another wish: can we have word-combining-categories and
> word-separating-categories display their elements with human-readable
> letters, not as their ASCII codes?  (Quick: what letter is code 94?)

How about modifing word_boundary_p to accept a mnemonic
string (instead of a mnemonic character) in those variables?
Then we can specify multiple categories in the string to
catch a character that have one of them.

---
Kenichi Handa
handa@ni.aist.go.jp

^ permalink raw reply	[flat|nested] 24+ messages in thread

* describe-categories (was: find-composition still depends on the composition property)
  2008-10-23  1:18               ` Kenichi Handa
@ 2008-10-23 23:44                 ` Juri Linkov
  2008-10-25  1:37                   ` Kenichi Handa
  2008-10-23 23:48                 ` Word boundary " Juri Linkov
  1 sibling, 1 reply; 24+ messages in thread
From: Juri Linkov @ 2008-10-23 23:44 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: Eli Zaretskii, emacs-devel

>> By the way, I'd suggest to move the legend generated by
>> describe-categories to the beginning of the buffer, because the buffer
>> is huge and it does not say anywhere at the beginning that there's a
>> legend at the end.  Without the legend, the buffer looks like a large
>> pile of gibberish.
>
> The legend is longer than 40 lines.  If we put that at the
> head, it will occupy the whole first page, which I think is
> not that good.  Saying something like "See the end of the
> buffer for the legend." with "legend" clickable at the first
> line will be good.  What do you think?

The buffer is so long already (ca 27000 lines) that adding 40 lines
at the beginning doesn't make it worse.  Otherwise, it is not obvious
for the user that the legend is at the end, so it is easy to miss it.
A link to the legend will help, but it seems moving 40 lines to the
beginning is not a bad thing.

-- 
Juri Linkov
http://www.jurta.org/emacs/




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Word boundary (was: find-composition still depends on the composition property)
  2008-10-23  1:18               ` Kenichi Handa
  2008-10-23 23:44                 ` describe-categories (was: find-composition still depends on the composition property) Juri Linkov
@ 2008-10-23 23:48                 ` Juri Linkov
  2008-10-25 18:03                   ` Eli Zaretskii
  2008-10-26  8:15                   ` Word boundary (was: find-composition still depends on the composition property) Kenichi Handa
  1 sibling, 2 replies; 24+ messages in thread
From: Juri Linkov @ 2008-10-23 23:48 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: Eli Zaretskii, emacs-devel

> Then what to do is:
>
> (1-1) assign the category "6" (digit) to "0123456789".
> (1-2) define a category, say "D", and assign it to all
> characters that have no word-boundary between digits.
> (1-3) add (?D . ?6) and (?6 . ?D) to word-combining-categories.
>
> Another way is:
>
> (2-1) modify word_boundary_p to handle negative category mnemonic in
> word-*-categories to catch a character that doesn't have the
> specified category.
> (2-2) assign the category "6" (digit) to "0123456789".
> (2-3) define a category, say "X", and assign it to all
> characters that have word-boundary between digits.
> (2-4) add ((- ?X) . ?6) and (?6 . (- ?X)) to
> word-combining-categories.
>
> Or,
>
> (3-1) Make `common' script and classify digits, etc to it.
> (3-2) modify word_boundary_p not to distinguish `common' from
> any other script.
> (3-3) define a category, say "X", and assign it to all
> characters that have word-boundary between digits.
> (3-4) add (?X . ?6) and (?6 . ?X) to
> word-separating-categories.

Do you know how many scripts require word boundaries between
letters and digits?  Does the Unicode standard specify this?

If the majority of scripts does not require word boundaries,
then we could define a new category only for few exceptions,
and vice versa.

-- 
Juri Linkov
http://www.jurta.org/emacs/




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: describe-categories (was: find-composition still depends on the composition property)
  2008-10-23 23:44                 ` describe-categories (was: find-composition still depends on the composition property) Juri Linkov
@ 2008-10-25  1:37                   ` Kenichi Handa
  2008-10-25  8:33                     ` Eli Zaretskii
  0 siblings, 1 reply; 24+ messages in thread
From: Kenichi Handa @ 2008-10-25  1:37 UTC (permalink / raw)
  To: Juri Linkov; +Cc: eliz, emacs-devel

In article <87ej26rja6.fsf_-_@jurta.org>, Juri Linkov <juri@jurta.org> writes:

> The buffer is so long already (ca 27000 lines) that adding 40 lines
> at the beginning doesn't make it worse.  Otherwise, it is not obvious
> for the user that the legend is at the end, so it is easy to miss it.
> A link to the legend will help, but it seems moving 40 lines to the
> beginning is not a bad thing.

I changed all docstrings of categories to "TERSE TEXT\nFULL
LONG TEXT", and modified describe-categories to display
"TERSE TEXT"s at the head.  How about that?

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: describe-categories (was: find-composition still depends on the composition property)
  2008-10-25  1:37                   ` Kenichi Handa
@ 2008-10-25  8:33                     ` Eli Zaretskii
  0 siblings, 0 replies; 24+ messages in thread
From: Eli Zaretskii @ 2008-10-25  8:33 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: juri, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> CC: eliz@gnu.org, emacs-devel@gnu.org
> Date: Sat, 25 Oct 2008 10:37:53 +0900
> 
> In article <87ej26rja6.fsf_-_@jurta.org>, Juri Linkov <juri@jurta.org> writes:
> 
> > The buffer is so long already (ca 27000 lines) that adding 40 lines
> > at the beginning doesn't make it worse.  Otherwise, it is not obvious
> > for the user that the legend is at the end, so it is easy to miss it.
> > A link to the legend will help, but it seems moving 40 lines to the
> > beginning is not a bad thing.
> 
> I changed all docstrings of categories to "TERSE TEXT\nFULL
> LONG TEXT", and modified describe-categories to display
> "TERSE TEXT"s at the head.  How about that?

Much more reader-friendly, thank you.




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Word boundary (was: find-composition still depends on the composition property)
  2008-10-23 23:48                 ` Word boundary " Juri Linkov
@ 2008-10-25 18:03                   ` Eli Zaretskii
  2008-10-26 13:36                     ` Kenichi Handa
  2008-10-26  8:15                   ` Word boundary (was: find-composition still depends on the composition property) Kenichi Handa
  1 sibling, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2008-10-25 18:03 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel, handa

> From: Juri Linkov <juri@jurta.org>
> Cc: Eli Zaretskii <eliz@gnu.org>,  emacs-devel@gnu.org
> Date: Fri, 24 Oct 2008 02:48:44 +0300
> 
> Do you know how many scripts require word boundaries between
> letters and digits?  Does the Unicode standard specify this?

Unless I'm missing something important, my reading of th UAX #29
(http://www.unicode.org/reports/tr29/tr29-13.html) is that almost all
scripts should _not_ have word breaks between letters and digits.  And
neither should we define a word break on script boundaries, in most
cases.

So it sounds like our default behavior is simply wrong.




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Word boundary (was: find-composition still depends on the composition property)
  2008-10-23 23:48                 ` Word boundary " Juri Linkov
  2008-10-25 18:03                   ` Eli Zaretskii
@ 2008-10-26  8:15                   ` Kenichi Handa
  1 sibling, 0 replies; 24+ messages in thread
From: Kenichi Handa @ 2008-10-26  8:15 UTC (permalink / raw)
  To: Juri Linkov; +Cc: eliz, emacs-devel

In article <87mygusydi.fsf_-_@jurta.org>, Juri Linkov <juri@jurta.org> writes:

> Do you know how many scripts require word boundaries between
> letters and digits?

I don't know the number, but I think it is convenient for
Japanese user if there's a word boundary between han,
katakana, hiragana and digits.

> Does the Unicode standard specify this?

To my understanding, it's also put word boundary between them.

> If the majority of scripts does not require word boundaries,
> then we could define a new category only for few exceptions,
> and vice versa.

I'll try it.

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Word boundary (was: find-composition still depends on the composition property)
  2008-10-25 18:03                   ` Eli Zaretskii
@ 2008-10-26 13:36                     ` Kenichi Handa
  2008-10-26 19:32                       ` Eli Zaretskii
  0 siblings, 1 reply; 24+ messages in thread
From: Kenichi Handa @ 2008-10-26 13:36 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: juri, emacs-devel

In article <uskqka6xq.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> Unless I'm missing something important, my reading of th UAX #29
> (http://www.unicode.org/reports/tr29/tr29-13.html) is that almost all
> scripts should _not_ have word breaks between letters and digits.  And
> neither should we define a word break on script boundaries, in most
> cases.

Although it says "Do not break between most letters. ALetter
x ALetter", ALetter doesn't include Han, Katakana, and
Hiragana.

And, it also has this note:

Normally word breaking does not require breaking between
different scripts. However, adding that capability may be
useful in combination with other extensions of word
segmentation. For example, ...

---
Kenichi Handa
handa@ni.aist.go.jp

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Word boundary (was: find-composition still depends on the composition property)
  2008-10-26 13:36                     ` Kenichi Handa
@ 2008-10-26 19:32                       ` Eli Zaretskii
  2008-10-27  0:17                         ` Word boundary Miles Bader
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2008-10-26 19:32 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: juri, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> CC: juri@jurta.org, emacs-devel@gnu.org
> Date: Sun, 26 Oct 2008 22:36:05 +0900
> 
> In article <uskqka6xq.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:
> 
> > Unless I'm missing something important, my reading of th UAX #29
> > (http://www.unicode.org/reports/tr29/tr29-13.html) is that almost all
> > scripts should _not_ have word breaks between letters and digits.  And
> > neither should we define a word break on script boundaries, in most
> > cases.
> 
> Although it says "Do not break between most letters. ALetter
> x ALetter", ALetter doesn't include Han, Katakana, and
> Hiragana.

Yes, that's why I said "in most cases".

> And, it also has this note:
> 
> Normally word breaking does not require breaking between
> different scripts. However, adding that capability may be
> useful in combination with other extensions of word
> segmentation. For example, ...

So maybe we should have a user option to enable that, but I think it
should be off by default.




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Word boundary
  2008-10-26 19:32                       ` Eli Zaretskii
@ 2008-10-27  0:17                         ` Miles Bader
  2008-10-27  0:27                           ` Kenichi Handa
  0 siblings, 1 reply; 24+ messages in thread
From: Miles Bader @ 2008-10-27  0:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: juri, emacs-devel, Kenichi Handa

Eli Zaretskii <eliz@gnu.org> writes:
>> Normally word breaking does not require breaking between
>> different scripts. However, adding that capability may be
>> useful in combination with other extensions of word
>> segmentation. For example, ...
>
> So maybe we should have a user option to enable that, but I think it
> should be off by default.

What would the practical effect of that be?  Would it break filling in Japanese?

-Miles

-- 
What the fuck do white people have to be blue about!?  Banana Republic ran
out of Khakis?  The Espresso Machine is jammed?  Hootie and The Blowfish
are breaking up??!  Shit, white people oughtta understand, their job is to
GIVE people the blues, not to get them!  -- George Carlin




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Word boundary
  2008-10-27  0:17                         ` Word boundary Miles Bader
@ 2008-10-27  0:27                           ` Kenichi Handa
  2008-10-27  4:12                             ` Eli Zaretskii
  2008-10-27  5:16                             ` Miles Bader
  0 siblings, 2 replies; 24+ messages in thread
From: Kenichi Handa @ 2008-10-27  0:27 UTC (permalink / raw)
  To: Miles Bader; +Cc: juri, eliz, emacs-devel

In article <87bpx66gdx.fsf@catnip.gol.com>, Miles Bader <miles@gnu.org> writes:

> Eli Zaretskii <eliz@gnu.org> writes:
>>> Normally word breaking does not require breaking between
>>> different scripts. However, adding that capability may be
>>> useful in combination with other extensions of word
>>> segmentation. For example, ...
> >
> > So maybe we should have a user option to enable that, but I think it
> > should be off by default.

> What would the practical effect of that be?  Would it break filling in Japanese?

I think no.  Filling is related to line-breaking, and it
should be treated differently from word-breaking.

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Word boundary
  2008-10-27  0:27                           ` Kenichi Handa
@ 2008-10-27  4:12                             ` Eli Zaretskii
  2008-10-27  5:16                             ` Miles Bader
  1 sibling, 0 replies; 24+ messages in thread
From: Eli Zaretskii @ 2008-10-27  4:12 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: juri, emacs-devel, miles

> From: Kenichi Handa <handa@m17n.org>
> CC: eliz@gnu.org, juri@jurta.org, emacs-devel@gnu.org
> Date: Mon, 27 Oct 2008 09:27:53 +0900
> 
> In article <87bpx66gdx.fsf@catnip.gol.com>, Miles Bader <miles@gnu.org> writes:
> 
> > Eli Zaretskii <eliz@gnu.org> writes:
> >>> Normally word breaking does not require breaking between
> >>> different scripts. However, adding that capability may be
> >>> useful in combination with other extensions of word
> >>> segmentation. For example, ...
> > >
> > > So maybe we should have a user option to enable that, but I think it
> > > should be off by default.
> 
> > What would the practical effect of that be?  Would it break filling in Japanese?
> 
> I think no.  Filling is related to line-breaking, and it
> should be treated differently from word-breaking.

Right.  In addition, I think Japanese is already pretty well covered
by the advice in UAX#29 anyway, as it has specific rules for handling
Katakana and Hiragana.




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Word boundary
  2008-10-27  0:27                           ` Kenichi Handa
  2008-10-27  4:12                             ` Eli Zaretskii
@ 2008-10-27  5:16                             ` Miles Bader
  2008-10-31  5:50                               ` Kenichi Handa
  1 sibling, 1 reply; 24+ messages in thread
From: Miles Bader @ 2008-10-27  5:16 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: juri, eliz, emacs-devel

Kenichi Handa <handa@m17n.org> writes:
>> What would the practical effect of that be?  Would it break filling in Japanese?
>
> I think no.  Filling is related to line-breaking, and it
> should be treated differently from word-breaking.

Er, for "filling", read "word movement".

-Miles

-- 
Success, n. The one unpardonable sin against one's fellows.




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Word boundary
  2008-10-27  5:16                             ` Miles Bader
@ 2008-10-31  5:50                               ` Kenichi Handa
  0 siblings, 0 replies; 24+ messages in thread
From: Kenichi Handa @ 2008-10-31  5:50 UTC (permalink / raw)
  To: Miles Bader; +Cc: juri, eliz, emacs-devel

In article <buowsfu7h3i.fsf@dhapc248.dev.necel.com>, Miles Bader <miles@gnu.org> writes:

> Kenichi Handa <handa@m17n.org> writes:
>>> What would the practical effect of that be?  Would it break filling in Japanese?
> >
> > I think no.  Filling is related to line-breaking, and it
> > should be treated differently from word-breaking.

> Er, for "filling", read "word movement".

If "that" in "the practical effect of that" means to add
user option to control word-breaking at script boundry on
and off, of course, M-f (forward-word) and M-b
(backward-word) will be affected by it.

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-10-31  5:50 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-29 13:46 find-composition still depends on the composition property Juanma Barranquero
2008-09-05  1:24 ` Kenichi Handa
2008-10-19 23:15   ` Juri Linkov
2008-10-20  6:46     ` Kenichi Handa
2008-10-21 23:46       ` Juri Linkov
2008-10-22  1:17         ` Kenichi Handa
2008-10-22  4:25           ` Eli Zaretskii
2008-10-22  5:43             ` Kenichi Handa
2008-10-22  5:29           ` Kenichi Handa
2008-10-22 19:35             ` Eli Zaretskii
2008-10-23  1:18               ` Kenichi Handa
2008-10-23 23:44                 ` describe-categories (was: find-composition still depends on the composition property) Juri Linkov
2008-10-25  1:37                   ` Kenichi Handa
2008-10-25  8:33                     ` Eli Zaretskii
2008-10-23 23:48                 ` Word boundary " Juri Linkov
2008-10-25 18:03                   ` Eli Zaretskii
2008-10-26 13:36                     ` Kenichi Handa
2008-10-26 19:32                       ` Eli Zaretskii
2008-10-27  0:17                         ` Word boundary Miles Bader
2008-10-27  0:27                           ` Kenichi Handa
2008-10-27  4:12                             ` Eli Zaretskii
2008-10-27  5:16                             ` Miles Bader
2008-10-31  5:50                               ` Kenichi Handa
2008-10-26  8:15                   ` Word boundary (was: find-composition still depends on the composition property) Kenichi Handa

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).