Word syntax question

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* Word syntax question
@ 2008-10-21 15:20 Miles Bader
  2008-10-21 15:52 ` Andreas Schwab
  0 siblings, 1 reply; 15+ messages in thread
From: Miles Bader @ 2008-10-21 15:20 UTC (permalink / raw)
  To: emacs-devel

In the following word, all characters have syntax "w" (and have unicode
category "Ll"), but forward-word doesn't jump over the whole thing, it
stops twice.  Anybody know why this happens?

   ʇsǝʇ

Starting with point on the first "ʇ", the first forward-word stops with
point on the "s", then another forward-word stops with point on the 2nd
"ʇ", then finally another will jump to the end of the word.

I know that the non-"s" characters are kind of weird, but based on the
syntax, it seems like emacs should be treating them like any other
letter.

[If I do `C-u C-x =' on the various characters, I notice there's a
"category:" field.  All of the characters have category "l" ("Latin"),
and "s" additionally has category "a" (ASCII), and "r".]

Thanks,

-Miles

-- 
Christian, n. One who follows the teachings of Christ so long as they are not
inconsistent with a life of sin.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Word syntax question
  2008-10-21 15:20 Word syntax question Miles Bader
@ 2008-10-21 15:52 ` Andreas Schwab
  2008-10-21 16:35   ` Miles Bader
  0 siblings, 1 reply; 15+ messages in thread
From: Andreas Schwab @ 2008-10-21 15:52 UTC (permalink / raw)
  To: Miles Bader; +Cc: emacs-devel

Miles Bader <miles@gnu.org> writes:

> In the following word, all characters have syntax "w" (and have unicode
> category "Ll"), but forward-word doesn't jump over the whole thing, it
> stops twice.  Anybody know why this happens?
>
>    ʇsǝʇ

See char-script-table, forward-word also stops at a script boundary.

<http://repo.or.cz/w/emacs.git?a=commit;h=e2e4957eb54a92d06b57828ec1e0df1c056321a8>

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Word syntax question
  2008-10-21 15:52 ` Andreas Schwab
@ 2008-10-21 16:35   ` Miles Bader
  2008-10-21 17:21     ` Eli Zaretskii
  0 siblings, 1 reply; 15+ messages in thread
From: Miles Bader @ 2008-10-21 16:35 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: emacs-devel

Andreas Schwab <schwab@suse.de> writes:
>> In the following word, all characters have syntax "w" (and have unicode
>> category "Ll"), but forward-word doesn't jump over the whole thing, it
>> stops twice.  Anybody know why this happens?
>>
>>    ʇsǝʇ
>
> See char-script-table, forward-word also stops at a script boundary.

That seems kind of broken in this case -- it's quite common for
"phonetic" characters to be intermixed in a word with latin characters,
and certainly nobody thinks of those boundaries as being word
boundaries.

What else is the "script" info used for?

[Of course it would also be good if `C-u C-x =' mentioned the script of
a character.]

-Miles

-- 
Do not taunt Happy Fun Ball.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Word syntax question
  2008-10-21 16:35   ` Miles Bader
@ 2008-10-21 17:21     ` Eli Zaretskii
  2008-10-22  0:58       ` Miles Bader
  2008-10-22  6:20       ` Richard M. Stallman
  0 siblings, 2 replies; 15+ messages in thread
From: Eli Zaretskii @ 2008-10-21 17:21 UTC (permalink / raw)
  To: Miles Bader; +Cc: schwab, emacs-devel

> From: Miles Bader <miles@gnu.org>
> Date: Wed, 22 Oct 2008 01:35:02 +0900
> Cc: emacs-devel@gnu.org
> 
> Andreas Schwab <schwab@suse.de> writes:
> > See char-script-table, forward-word also stops at a script boundary.
> 
> That seems kind of broken in this case -- it's quite common for
> "phonetic" characters to be intermixed in a word with latin characters,
> and certainly nobody thinks of those boundaries as being word
> boundaries.

I agree.  I think we should introduce a user option to control whether
it stops on script boundaries or not, because sometimes it makes
sense, sometimes it doesn't.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Word syntax question
  2008-10-21 17:21     ` Eli Zaretskii
@ 2008-10-22  0:58       ` Miles Bader
  2008-10-22  3:11         ` Stephen J. Turnbull
  2008-10-22  4:29         ` Eli Zaretskii
  2008-10-22  6:20       ` Richard M. Stallman
  1 sibling, 2 replies; 15+ messages in thread
From: Miles Bader @ 2008-10-22  0:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: schwab, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:
>> > See char-script-table, forward-word also stops at a script boundary.
>> 
>> That seems kind of broken in this case -- it's quite common for
>> "phonetic" characters to be intermixed in a word with latin characters,
>> and certainly nobody thinks of those boundaries as being word
>> boundaries.
>
> I agree.  I think we should introduce a user option to control whether
> it stops on script boundaries or not, because sometimes it makes
> sense, sometimes it doesn't.

But a global setting seems far too course, and in general, whether it's
"right" or not seems like it depends more on the precise mixture of
scripts rather than a user's personal preferences.

-Miles

-- 
Do not taunt Happy Fun Ball.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Word syntax question
  2008-10-22  0:58       ` Miles Bader
@ 2008-10-22  3:11         ` Stephen J. Turnbull
  2008-10-22 11:52           ` Kenichi Handa
  2008-10-22 12:23           ` Kenichi Handa
  2008-10-22  4:29         ` Eli Zaretskii
  1 sibling, 2 replies; 15+ messages in thread
From: Stephen J. Turnbull @ 2008-10-22  3:11 UTC (permalink / raw)
  To: Miles Bader; +Cc: schwab, Eli Zaretskii, emacs-devel

Miles Bader writes:
 > Eli Zaretskii <eliz@gnu.org> writes:
 > >> > See char-script-table, forward-word also stops at a script boundary.
 > >> 
 > >> That seems kind of broken in this case -- it's quite common for
 > >> "phonetic" characters to be intermixed in a word with latin characters,
 > >> and certainly nobody thinks of those boundaries as being word
 > >> boundaries.
 > >
 > > I agree.  I think we should introduce a user option to control whether
 > > it stops on script boundaries or not, because sometimes it makes
 > > sense, sometimes it doesn't.
 > 
 > But a global setting seems far too course, and in general, whether it's
 > "right" or not seems like it depends more on the precise mixture of
 > scripts rather than a user's personal preferences.

AFAIK Unicode has solved this problem, but I forget where I saw it.
If my memory is correct, that supports Miles's opinion.

In general, I think that if the scripts are for different human
languages, it's almost always the case that a script boundary is a
word boundary.  (But I'm biased, because I deal with that daily in
ordinary Japanese text, where that is the case.)  If one script is not
language-specific (IPA is really the only one I can think of), it's
not.  Note that for something like Japanese which has three separate
scripts (hiragana, katakana, and kanji) which are separately
standardized (JIS X 0201 for katakana, and JIS X 0213 for the others)
this care for different scripts, same language already needs to be made.

So it seems to me that an exceptional case for IPA (make it a member
of all language groups, or perhaps of those that use the Latin
alphabet?) should be sufficient.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Word syntax question
  2008-10-22  0:58       ` Miles Bader
  2008-10-22  3:11         ` Stephen J. Turnbull
@ 2008-10-22  4:29         ` Eli Zaretskii
  2008-10-22  5:16           ` Miles Bader
  2008-10-22 21:02           ` Richard M. Stallman
  1 sibling, 2 replies; 15+ messages in thread
From: Eli Zaretskii @ 2008-10-22  4:29 UTC (permalink / raw)
  To: Miles Bader; +Cc: schwab, emacs-devel

> From: Miles Bader <miles@gnu.org>
> Cc: schwab@suse.de,  emacs-devel@gnu.org
> Date: Wed, 22 Oct 2008 09:58:54 +0900
> 
> > I agree.  I think we should introduce a user option to control whether
> > it stops on script boundaries or not, because sometimes it makes
> > sense, sometimes it doesn't.
> 
> But a global setting seems far too course, and in general, whether it's
> "right" or not seems like it depends more on the precise mixture of
> scripts rather than a user's personal preferences.

Not global, buffer-specific.  Whether stopping or not on script
boundaries depends on the specific mix of scripts in the buffer.

In addition, perhaps each word-move command should have a way of
overriding that.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Word syntax question
  2008-10-22  4:29         ` Eli Zaretskii
@ 2008-10-22  5:16           ` Miles Bader
  2008-10-22 19:36             ` Eli Zaretskii
  2008-10-22 21:02           ` Richard M. Stallman
  1 sibling, 1 reply; 15+ messages in thread
From: Miles Bader @ 2008-10-22  5:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: schwab, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:
>> But a global setting seems far too course, and in general, whether it's
>> "right" or not seems like it depends more on the precise mixture of
>> scripts rather than a user's personal preferences.
>
> Not global, buffer-specific.  Whether stopping or not on script
> boundaries depends on the specific mix of scripts in the buffer.
>
> In addition, perhaps each word-move command should have a way of
> overriding that.

No, I mean a setting which is only "yea or nay" is too coarse (as
opposed to one that targets specific cases). 

-Miles

-- 
Pray, v. To ask that the laws of the universe be annulled in behalf of a
single petitioner confessedly unworthy.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Word syntax question
  2008-10-21 17:21     ` Eli Zaretskii
  2008-10-22  0:58       ` Miles Bader
@ 2008-10-22  6:20       ` Richard M. Stallman
  2008-10-22 19:21         ` Eli Zaretskii
  1 sibling, 1 reply; 15+ messages in thread
From: Richard M. Stallman @ 2008-10-22  6:20 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: schwab, emacs-devel, miles

    I agree.  I think we should introduce a user option to control whether
    it stops on script boundaries or not, because sometimes it makes
    sense, sometimes it doesn't.

That is not a real solution.  The right thing to do is a function
of the case, not the user.  Making each user specify an option
according to which cases she typically encounters is not clean.

It seems that we need a way to specify which kinds of script
boundaries should be word boundaries, on designed to produce the
results that users generally want, and which could be set up inside
Emacs so that users don't have to change it.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Word syntax question
  2008-10-22  3:11         ` Stephen J. Turnbull
@ 2008-10-22 11:52           ` Kenichi Handa
  2008-10-22 12:23           ` Kenichi Handa
  1 sibling, 0 replies; 15+ messages in thread
From: Kenichi Handa @ 2008-10-22 11:52 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: schwab, eliz, emacs-devel, miles

In article <87bpxdtjc7.fsf@xemacs.org>, "Stephen J. Turnbull" <stephen@xemacs.org> writes:

> So it seems to me that an exceptional case for IPA (make it a member
> of all language groups, or perhaps of those that use the Latin
> alphabet?) should be sufficient.

I think classifying those phonetic characters in `phonetic'
script is wrong.  At least, Unicode says that most of them
are Latin script.   Jason, why did you install this change
for phonetic characters?  Was it to select a proper font for
those characters?

2008-04-01  Jason Rumney  <jasonr@gnu.org>

	* international/characters.el (script-list): Add phonetic script,
	covering IPA (previously Latin), Phonetic Extensions and
	Phonetic Extensions Supplement (both previously unassigned).

	* international/fontset.el (setup-default-fontset): Use unicode fonts
	that cover bopomofo script for bopomofo.
	Likewise for braille and mathematical.
	Use unicode scripts that cover the phonetic script for IPA.

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Word syntax question
  2008-10-22  3:11         ` Stephen J. Turnbull
  2008-10-22 11:52           ` Kenichi Handa
@ 2008-10-22 12:23           ` Kenichi Handa
  1 sibling, 0 replies; 15+ messages in thread
From: Kenichi Handa @ 2008-10-22 12:23 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: schwab, eliz, emacs-devel, miles

In article <87bpxdtjc7.fsf@xemacs.org>, "Stephen J. Turnbull" <stephen@xemacs.org> writes:

> AFAIK Unicode has solved this problem, but I forget where I saw it.
> If my memory is correct, that supports Miles's opinion.

It's "Unicode Standard Annex #29" (http://www.unicode.org/reports/tr29/).
It shows an algorithm to determine if there's a word
boundary between character C1 C2 by categorizing characterers
by "Word_Break" property and giving a set of rules checking
that property.

Emacs already has a similar mechanism by using two variables
word-separating-categories and word-combining-categories.
Please read the docstring of the latter variable.

---
Kenichi Handa
handa@ni.aist.go.jp

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Word syntax question
  2008-10-22  6:20       ` Richard M. Stallman
@ 2008-10-22 19:21         ` Eli Zaretskii
  2008-10-23  3:09           ` Richard M. Stallman
  0 siblings, 1 reply; 15+ messages in thread
From: Eli Zaretskii @ 2008-10-22 19:21 UTC (permalink / raw)
  To: rms; +Cc: schwab, emacs-devel, miles

> From: "Richard M. Stallman" <rms@gnu.org>
> CC: miles@gnu.org, schwab@suse.de, emacs-devel@gnu.org
> Date: Wed, 22 Oct 2008 02:20:44 -0400
> 
>     I agree.  I think we should introduce a user option to control whether
>     it stops on script boundaries or not, because sometimes it makes
>     sense, sometimes it doesn't.
> 
> That is not a real solution.  The right thing to do is a function
> of the case, not the user.  Making each user specify an option
> according to which cases she typically encounters is not clean.

I agree that it would be better to solve this automatically, but I
sincerely doubt that we will get that right in time for the release
(unless we delay the release for many months).

> It seems that we need a way to specify which kinds of script
> boundaries should be word boundaries, on designed to produce the
> results that users generally want, and which could be set up inside
> Emacs so that users don't have to change it.

I think we lack the knowledge for doing this right.  We don't even
have enough experts on board to cover all the Unicode scripts, or even
their majority.  How in the world will we decide which scripts can or
cannot be mixed in the same word, let alone how this might change in
some specialized Emacs mode?  Unicode annexes know nothing about many
Emacs features, so their advice will not help us except maybe in Text
mode and its closest derivatives.  We will need to develop our own
solutions as we go, like we did with syntax tables in previous
versions, for example.

However, developing those solutions might take a lot of time and user
experience which we do not yet have.  It will take a lot of effort
just to solve the bugs and sluggish performance to bring Emacs 23 to a
releasable state, so if on top of that we delay the release until this
and similar Unicode-related issues are satisfactorily resolved, we
will not release Emacs 23 before another 2 or 3 years pass by.

With that in mind, my suggestion to provide a user option was meant to
give users a fire escape in case Emacs 23.1 does not get their
use-case right.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Word syntax question
  2008-10-22  5:16           ` Miles Bader
@ 2008-10-22 19:36             ` Eli Zaretskii
  0 siblings, 0 replies; 15+ messages in thread
From: Eli Zaretskii @ 2008-10-22 19:36 UTC (permalink / raw)
  To: Miles Bader; +Cc: schwab, emacs-devel

> From: Miles Bader <miles@gnu.org>
> Cc: schwab@suse.de, emacs-devel@gnu.org
> Date: Wed, 22 Oct 2008 14:16:34 +0900
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> >> But a global setting seems far too course, and in general, whether it's
> >> "right" or not seems like it depends more on the precise mixture of
> >> scripts rather than a user's personal preferences.
> >
> > Not global, buffer-specific.  Whether stopping or not on script
> > boundaries depends on the specific mix of scripts in the buffer.
> >
> > In addition, perhaps each word-move command should have a way of
> > overriding that.
> 
> No, I mean a setting which is only "yea or nay" is too coarse (as
> opposed to one that targets specific cases). 

Are you suggesting to use the categories, or something else?  If the
latter, what is it?




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Word syntax question
  2008-10-22  4:29         ` Eli Zaretskii
  2008-10-22  5:16           ` Miles Bader
@ 2008-10-22 21:02           ` Richard M. Stallman
  1 sibling, 0 replies; 15+ messages in thread
From: Richard M. Stallman @ 2008-10-22 21:02 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: schwab, emacs-devel, miles

    > But a global setting seems far too course, and in general, whether it's
    > "right" or not seems like it depends more on the precise mixture of
    > scripts rather than a user's personal preferences.

    Not global, buffer-specific.  Whether stopping or not on script
    boundaries depends on the specific mix of scripts in the buffer.

If that is so, we need to identify the different kinds of situations,
in terms of which scripts they combine and what behavior users want.
Then we should try to work out a way for Emacs to recognize these situations
and DTRT for each one.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Word syntax question
  2008-10-22 19:21         ` Eli Zaretskii
@ 2008-10-23  3:09           ` Richard M. Stallman
  0 siblings, 0 replies; 15+ messages in thread
From: Richard M. Stallman @ 2008-10-23  3:09 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: schwab, miles, emacs-devel

    I agree that it would be better to solve this automatically, but I
    sincerely doubt that we will get that right in time for the release
    (unless we delay the release for many months).

It is misguided to give up without a try just because a problem looks
hard.  If we try it, you may find it is not so hard.  So let's try it!





^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-10-23  3:09 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-21 15:20 Word syntax question Miles Bader
2008-10-21 15:52 ` Andreas Schwab
2008-10-21 16:35   ` Miles Bader
2008-10-21 17:21     ` Eli Zaretskii
2008-10-22  0:58       ` Miles Bader
2008-10-22  3:11         ` Stephen J. Turnbull
2008-10-22 11:52           ` Kenichi Handa
2008-10-22 12:23           ` Kenichi Handa
2008-10-22  4:29         ` Eli Zaretskii
2008-10-22  5:16           ` Miles Bader
2008-10-22 19:36             ` Eli Zaretskii
2008-10-22 21:02           ` Richard M. Stallman
2008-10-22  6:20       ` Richard M. Stallman
2008-10-22 19:21         ` Eli Zaretskii
2008-10-23  3:09           ` Richard M. Stallman

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.