From: Eli Zaretskii <eliz@gnu.org>
To: help-gnu-emacs@gnu.org
Subject: Re: word boundaries in Asian languages
Date: Mon, 19 Aug 2013 19:23:04 +0300 [thread overview]
Message-ID: <83bo4tlkev.fsf@gnu.org> (raw)
In-Reply-To: <87vc329dtf.fsf@ericabrahamsen.net>
> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Mon, 19 Aug 2013 18:26:20 +0800
>
> I use emacs for prose more than for programming, and I've been idly
> fiddling with making it a better environment for editing other
> languages, particularly Asian languages, particularly Chinese prose.
>
> One of the really awkward things about editing Chinese prose in Emacs is
> that word boundaries are bound to spaces -- in a language that doesn't
> use spaces to delineate words, movement and editing commands are thus
> restricted either to per-character, or per-punctuated-phrase. It's
> unwieldy.
>
> Accurately identifying word boundaries in Chinese is a subject of
> academic research, but a couple of C libraries have emerged (I've pasted
> a couple of likely links at the bottom).
>
> Given that this level of programming is _way_ above my pay grade, I
> raise the following totally hypothetical scenario. How likely is this:
The right place to discuss this is emacs-devel, not here.
> 1. I call "forward-word" (or some equivalent word-based command)
> 2. Emacs checks a variable like use-multilingual-words, or something to
> that makes all the following optional.
> 3. It's true, so we check the script of the following character, and try
> a lookup in a variable that pairs scripts with C libraries that
> provide word-level commands for those scripts.
> 4. A library is present! Instead of the usual "forward-word", we now
> call a function from that library to identify the next word boundary.
> Point goes either to that spot, or to the end of a contiguous run of
> characters of the same script that we started in.
>
> So external C libraries would have to be augmented with functions that
> did word boundary location in a way that made sense to emacs, but
> presumably the hard work would have already been done. Given my general
> ignorance, how unlikely is all of this?
A couple of comments:
. we already have large dictionaries in Emacs, the ones Leim input
methods use; perhaps they could serve double duty for matching text
to words
. there's also UAX#29 (http://www.unicode.org/reports/tr29/)
next prev parent reply other threads:[~2013-08-19 16:23 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-19 10:26 word boundaries in Asian languages Eric Abrahamsen
2013-08-19 16:23 ` Eli Zaretskii [this message]
2013-08-19 17:22 ` Thien-Thi Nguyen
2013-08-20 1:11 ` Eric Abrahamsen
[not found] <mailman.330.1376907966.10748.help-gnu-emacs@gnu.org>
2013-08-25 13:51 ` Stefan Monnier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=83bo4tlkev.fsf@gnu.org \
--to=eliz@gnu.org \
--cc=help-gnu-emacs@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).