From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#50247: 27.2; wrong `word-wrap' for Chinese characters Date: Sun, 29 Aug 2021 10:26:56 +0300 Message-ID: <838s0kn22n.fsf@gnu.org> References: <6C0CE5B4-06AC-401F-822D-5F414D88E5B3@icloud.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="15522"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 50247@debbugs.gnu.org To: ClaudeMonet Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Sun Aug 29 09:28:11 2021 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1mKFEg-0003nC-1V for geb-bug-gnu-emacs@m.gmane-mx.org; Sun, 29 Aug 2021 09:28:10 +0200 Original-Received: from localhost ([::1]:50606 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1mKFEe-0005X3-7z for geb-bug-gnu-emacs@m.gmane-mx.org; Sun, 29 Aug 2021 03:28:08 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:49684) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mKFEY-0005Wg-3c for bug-gnu-emacs@gnu.org; Sun, 29 Aug 2021 03:28:02 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:43915) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1mKFEX-00050E-Sr for bug-gnu-emacs@gnu.org; Sun, 29 Aug 2021 03:28:01 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1mKFEX-0001eO-Nx for bug-gnu-emacs@gnu.org; Sun, 29 Aug 2021 03:28:01 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 29 Aug 2021 07:28:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 50247 X-GNU-PR-Package: emacs Original-Received: via spool by 50247-submit@debbugs.gnu.org id=B50247.16302220406294 (code B ref 50247); Sun, 29 Aug 2021 07:28:01 +0000 Original-Received: (at 50247) by debbugs.gnu.org; 29 Aug 2021 07:27:20 +0000 Original-Received: from localhost ([127.0.0.1]:55461 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1mKFDr-0001dS-Kd for submit@debbugs.gnu.org; Sun, 29 Aug 2021 03:27:19 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:33522) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1mKFDn-0001dC-TK for 50247@debbugs.gnu.org; Sun, 29 Aug 2021 03:27:17 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:59452) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1mKFDi-0004JW-DR; Sun, 29 Aug 2021 03:27:10 -0400 Original-Received: from 84.94.185.95.cable.012.net.il ([84.94.185.95]:2544 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mKFDi-0000nS-16; Sun, 29 Aug 2021 03:27:10 -0400 In-Reply-To: <6C0CE5B4-06AC-401F-822D-5F414D88E5B3@icloud.com> (bug-gnu-emacs@gnu.org) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:212920 Archived-At: > Date: Sun, 29 Aug 2021 11:14:40 +0800 > From: ClaudeMonet via "Bug reports for GNU Emacs, > the Swiss army knife of text editors" > > When `toggle-word-wrap' is enabled, lines that ends with Chinese > characters and Chinese punctuations won't be seperated in the right > way, "normally", all Chinese words in a sentence will be crowded and > recognized by Emacs as one single WORD. > > e.g. "世界" is a word in > Chinese, and "世界人民大团结万岁。" is a full sentence ending with a > full width perid, and Emacs would recognize the sentence as a word, thus > wrap lines in a wrong way. Emacs 28 introduces the variable word-wrap-by-category; if you set that non-nil, the above should work as you expect, assuming the Kinsoku rules are good enough for that. (Since you didn't tell in detail what were your expectation of the "right way" in this case, I couldn't actually test that the results are as you expect.) > By the way, I think this one have long been a problem for Chinese users, > since we use full-width punctuation system instead in English half-width > is more generally adopted. Please elaborate in what way this presents a problem in Emacs, preferably with examples. > Another thing is, in Emacs when you use > `forward-word' key binding, I know English words are all separated > either by punctuations or blank characters(, , etc.), but in > Chinese, words in a single sentence are usually separated by nothing, I > don't know what the normal practice for "word recognizing" tasks is on > modern OS like Mac and Windows. I guess there is a dictionary mechanism. Emacs has find-word-boundary-function-table, which can be used to define our rules. In general, we try to follow Unicode, but AFAIU Unicode TR29 doesn't specify any word-breaking rules for Chinese characters. > A footnote here, for tokenizing Chinese words, there is a Python > tokenizor called "jieba" in NLP field, would be a great reference if you > guys are going to address this issue. The github link of "jieba" is: > > https://github.com/fxsjy/jieba Patches are welcome to add Chinese text segmentation capabilities to Emacs.