all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* word boundaries in Asian languages
@ 2013-08-19 10:26 Eric Abrahamsen
  2013-08-19 16:23 ` Eli Zaretskii
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Abrahamsen @ 2013-08-19 10:26 UTC (permalink / raw)
  To: help-gnu-emacs

I use emacs for prose more than for programming, and I've been idly
fiddling with making it a better environment for editing other
languages, particularly Asian languages, particularly Chinese prose.

One of the really awkward things about editing Chinese prose in Emacs is
that word boundaries are bound to spaces -- in a language that doesn't
use spaces to delineate words, movement and editing commands are thus
restricted either to per-character, or per-punctuated-phrase. It's
unwieldy.

Accurately identifying word boundaries in Chinese is a subject of
academic research, but a couple of C libraries have emerged (I've pasted
a couple of likely links at the bottom).

Given that this level of programming is _way_ above my pay grade, I
raise the following totally hypothetical scenario. How likely is this:

1. I call "forward-word" (or some equivalent word-based command)
2. Emacs checks a variable like use-multilingual-words, or something to 
   that makes all the following optional.
3. It's true, so we check the script of the following character, and try
   a lookup in a variable that pairs scripts with C libraries that
   provide word-level commands for those scripts.
4. A library is present! Instead of the usual "forward-word", we now
   call a function from that library to identify the next word boundary.
   Point goes either to that spot, or to the end of a contiguous run of
   characters of the same script that we started in.

So external C libraries would have to be augmented with functions that
did word boundary location in a way that made sense to emacs, but
presumably the hard work would have already been done. Given my general
ignorance, how unlikely is all of this?

Thanks!
Eric

http://technology.chtsai.org/mmseg/
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.8593




^ permalink raw reply	[flat|nested] 5+ messages in thread
[parent not found: <mailman.330.1376907966.10748.help-gnu-emacs@gnu.org>]

end of thread, other threads:[~2013-08-25 13:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-19 10:26 word boundaries in Asian languages Eric Abrahamsen
2013-08-19 16:23 ` Eli Zaretskii
2013-08-19 17:22   ` Thien-Thi Nguyen
2013-08-20  1:11     ` Eric Abrahamsen
     [not found] <mailman.330.1376907966.10748.help-gnu-emacs@gnu.org>
2013-08-25 13:51 ` Stefan Monnier

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.