From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eric Abrahamsen Newsgroups: gmane.emacs.help Subject: word boundaries in Asian languages Date: Mon, 19 Aug 2013 18:26:20 +0800 Message-ID: <87vc329dtf.fsf@ericabrahamsen.net> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: ger.gmane.org 1376907987 26328 80.91.229.3 (19 Aug 2013 10:26:27 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 19 Aug 2013 10:26:27 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Mon Aug 19 12:26:29 2013 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1VBMfN-0005ZP-6g for geh-help-gnu-emacs@m.gmane.org; Mon, 19 Aug 2013 12:26:29 +0200 Original-Received: from localhost ([::1]:42011 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VBMfM-0000Sy-Pv for geh-help-gnu-emacs@m.gmane.org; Mon, 19 Aug 2013 06:26:28 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:58787) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VBMes-0008Sh-MY for help-gnu-emacs@gnu.org; Mon, 19 Aug 2013 06:26:04 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VBMem-0006xr-D3 for help-gnu-emacs@gnu.org; Mon, 19 Aug 2013 06:25:58 -0400 Original-Received: from plane.gmane.org ([80.91.229.3]:50986) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VBMem-0006xi-6s for help-gnu-emacs@gnu.org; Mon, 19 Aug 2013 06:25:52 -0400 Original-Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1VBMek-00056A-DC for help-gnu-emacs@gnu.org; Mon, 19 Aug 2013 12:25:50 +0200 Original-Received: from 114.252.246.79 ([114.252.246.79]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 19 Aug 2013 12:25:50 +0200 Original-Received: from eric by 114.252.246.79 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 19 Aug 2013 12:25:50 +0200 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 38 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 114.252.246.79 User-Agent: Gnus/5.130008 (Ma Gnus v0.8) Emacs/24.3 (gnu/linux) Cancel-Lock: sha1:nBLdvOO4tHBg81j0AkIwsJ13oJk= X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 80.91.229.3 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:92982 Archived-At: I use emacs for prose more than for programming, and I've been idly fiddling with making it a better environment for editing other languages, particularly Asian languages, particularly Chinese prose. One of the really awkward things about editing Chinese prose in Emacs is that word boundaries are bound to spaces -- in a language that doesn't use spaces to delineate words, movement and editing commands are thus restricted either to per-character, or per-punctuated-phrase. It's unwieldy. Accurately identifying word boundaries in Chinese is a subject of academic research, but a couple of C libraries have emerged (I've pasted a couple of likely links at the bottom). Given that this level of programming is _way_ above my pay grade, I raise the following totally hypothetical scenario. How likely is this: 1. I call "forward-word" (or some equivalent word-based command) 2. Emacs checks a variable like use-multilingual-words, or something to that makes all the following optional. 3. It's true, so we check the script of the following character, and try a lookup in a variable that pairs scripts with C libraries that provide word-level commands for those scripts. 4. A library is present! Instead of the usual "forward-word", we now call a function from that library to identify the next word boundary. Point goes either to that spot, or to the end of a contiguous run of characters of the same script that we started in. So external C libraries would have to be augmented with functions that did word boundary location in a way that made sense to emacs, but presumably the hard work would have already been done. Given my general ignorance, how unlikely is all of this? Thanks! Eric http://technology.chtsai.org/mmseg/ http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.8593