all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* bug#50247: 27.2; wrong `word-wrap' for Chinese characters
@ 2021-08-29  3:14 ClaudeMonet via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2021-08-29  7:26 ` Eli Zaretskii
  0 siblings, 1 reply; 3+ messages in thread
From: ClaudeMonet via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2021-08-29  3:14 UTC (permalink / raw)
  To: 50247



When `toggle-word-wrap' is enabled, lines that ends with Chinese
characters and Chinese punctuations won't be seperated in the right
way, "normally", all Chinese words in a sentence will be crowded and
recognized by Emacs as one single WORD.

e.g. "世界" is a word in
Chinese, and "世界人民大团结万岁。" is a full sentence ending with a
full width perid, and Emacs would recognize the sentence as a word, thus
wrap lines in a wrong way.

By the way, I think this one have long been a problem for Chinese users,
since we use full-width punctuation system instead in English half-width
is more generally adopted. Another thing is, in Emacs when you use
`forward-word' key binding, I know English words are all separated
either by punctuations or blank characters(<space>, <tab>, etc.), but in
Chinese, words in a single sentence are usually separated by nothing, I
don't know what the normal practice for "word recognizing" tasks is on
modern OS like Mac and Windows. I guess there is a dictionary mechanism.

A footnote here, for tokenizing Chinese words, there is a Python
tokenizor called "jieba" in NLP field, would be a great reference if you
guys are going to address this issue. The github link of "jieba" is:

	https://github.com/fxsjy/jieba

Thanks!


In GNU Emacs 27.2 (build 1, x86_64-apple-darwin18.7.0, NS appkit-1671.60 Version 10.14.6 (Build 18G95))
of 2021-03-28 built on builder10-14.porkrind.org
Windowing system distributor 'Apple', version 10.3.2022
System Description:  macOS 11.5.2

Recent messages:
Wrote /Users/claude/.emacs.d/lisp/init-preload-local.el
Quit
Type "q" in help window to delete it.
C-c C-o is undefined
uncompressing simple.el.gz...done
Mark set
find-function-C-source: The C source file buffer.c is not available
Quit [2 times]

Mark set

Configured using:
'configure --with-ns '--enable-locallisppath=/Library/Application
Support/Emacs/${version}/site-lisp:/Library/Application
Support/Emacs/site-lisp' --with-modules'

Configured features:
NOTIFY KQUEUE ACL GNUTLS LIBXML2 ZLIB TOOLKIT_SCROLL_BARS NS MODULES
THREADS JSON PDUMPER GMP

Important settings:
  value of $LANG: en_CN.UTF-8
  locale-coding-system: utf-8

Major mode: Org

Minor modes in effect:
  default-text-scale-mode: t
  recentf-mode: t
  vertico-mode: t
  marginalia-mode: t
  company-quickhelp-mode: t
  company-quickhelp-local-mode: t
  winner-mode: t
  flycheck-color-mode-line-mode: t
  global-flycheck-mode: t
  flycheck-mode: t
  dimmer-mode: t
  global-anzu-mode: t
  anzu-mode: t
  global-company-mode: t
  company-mode: t
  diredfl-global-mode: t
  shell-dirtrack-mode: t
  savehist-mode: t
  electric-pair-mode: t
  delete-selection-mode: t
  global-auto-revert-mode: t
  global-so-long-mode: t
  mode-line-bell-mode: t
  beacon-mode: t
  show-paren-mode: t
  global-page-break-lines-mode: t
  page-break-lines-mode: t
  whole-line-or-region-global-mode: t
  whole-line-or-region-local-mode: t
  hes-mode: t
  which-key-mode: t
  global-whitespace-cleanup-mode: t
  whitespace-cleanup-mode: t
  global-diff-hl-mode: t
  diff-hl-mode: t
  projectile-rails-global-mode: t
  projectile-mode: t
  ipretty-mode: t
  auto-compile-on-load-mode: t
  auto-compile-on-save-mode: t
  immortal-scratch-mode: t
  desktop-save-mode: t
  ns-auto-titlebar-mode: t
  tooltip-mode: t
  global-eldoc-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  column-number-mode: t
  line-number-mode: t
  auto-fill-function: org-auto-fill-function
  visual-line-mode: t
  transient-mark-mode: t

Load-path shadows:
/Users/claude/.emacs.d/elpa-27.2/magit-20210822.529/magit-section-pkg hides /Users/claude/.emacs.d/elpa-27.2/magit-section-20210819.1119/magit-section-pkg
/Users/claude/.emacs.d/elpa-27.2/seq-2.22/seq hides /Applications/Emacs.app/Contents/Resources/lisp/emacs-lisp/seq

Features:
(shadow sort mail-extr emacsbug sendmail consult-vertico consult
bookmark ielm tabify view cl-print eieio-opt speedbar sb-image ezimage
dframe rainbow-mode help-fns radix-tree switch-window
switch-window-mvborder switch-window-asciiart quail executable cus-edit
cus-start cus-load sanityinc-tomorrow-bright-theme
color-theme-sanityinc-tomorrow default-text-scale recentf tree-widget
orderless vertico marginalia company-quickhelp pos-tip winner windswap
windmove vc-bzr vc-src vc-sccs vc-svn vc-cvs vc-rcs diff-hl-dired
elisp-slime-nav paredit aggressive-indent highlight-quoted
display-line-numbers display-fill-column-indicator rainbow-delimiters
symbol-overlay bug-reference goto-addr flycheck-color-mode-line
flycheck-package package-lint let-alist imenu finder flycheck dimmer
face-remap color anzu company-oddmuse company-keywords company-etags
etags fileloop company-gtags company-dabbrev-code company-dabbrev
company-files company-clang company-capf company-cmake company-semantic
company-bbdb company-php company-template ac-php-core popup xcscope
company-anaconda anaconda-mode xref project pythonic
company-nixos-options nixos-options company pcase disp-table vc-git
vc-darcs org-element avl-tree generator ol-eww eww mm-url url-queue
ol-rmail ol-mhe ol-irc ol-info ol-gnus nnir gnus-sum url url-proxy
url-privacy url-expand url-methods url-history mailcap shr url-cookie
url-domsuf url-util svg xml dom gnus-group gnus-undo gnus-start
gnus-cloud nnimap nnmail mail-source utf7 netrc nnoo gnus-spec gnus-int
gnus-range message rmc puny rfc822 mml mml-sec epa epg epg-config
mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils
mailheader gnus-win gnus nnheader gnus-util rmail rmail-loaddefs rfc2047
rfc2045 ietf-drums text-property-search mail-utils mm-util mail-prsvr
wid-edit ol-docview doc-view image-mode exif dired-x diredfl dired
dired-loaddefs ol-bibtex bibtex ol-bbdb ol-w3m ob-sqlite ob-sql ob-shell
ob-ruby ob-python python tramp-sh docker-tramp tramp-cache tramp
tramp-loaddefs trampver tramp-integration files-x tramp-compat shell
parse-time iso8601 ls-lisp ob-plantuml ob-octave ob-ledger ob-latex
ob-gnuplot ob-dot ob-ditaa ob-R org-clock org ob ob-tangle ob-ref ob-lob
ob-table ob-exp org-macro org-footnote org-src ob-comint org-pcomplete
pcomplete org-list org-faces org-entities time-date noutline outline
org-version ob-emacs-lisp ob-core ob-eval org-table ol org-keys
org-compat org-macs org-loaddefs format-spec find-func cal-menu calendar
cal-loaddefs savehist session elec-pair delsel autorevert filenotify
so-long mode-line-bell beacon paren page-break-lines
whole-line-or-region highlight-escape-sequences which-key diminish
whitespace-cleanup-mode whitespace diff-hl log-view pcvs-util vc-dir
ewoc vc vc-dispatcher diff-mode cl-extra help-mode projectile-rails rake
f dash s inflections inf-ruby ruby-mode smie autoinsert projectile
lisp-mnt grep compile comint ring ibuf-ext ibuffer ibuffer-loaddefs
thingatpt jka-compr ipretty advice auto-compile packed immortal-scratch
uptimes pp server init init-locales init-direnv init-ledger init-dash
init-folding init-misc init-common-lisp init-clojure-cider init-clojure
init-slime init-lisp init-paredit init-nix init-terraform init-docker
init-yaml init-toml init-rust init-nim init-j init-ocaml init-sql
init-rails init-ruby init-purescript init-elm init-haskell init-python
reformatter ansi-color init-http init-haml init-css init-html init-nxml
init-org init-php init-javascript easy-mmode init-erlang erlang-start
init-csv init-markdown init-textile init-crontab init-compile
init-projectile init-github init-git init-darcs init-vc init-whitespace
init-editing-utils init-mmm mmm-auto mmm-vars mmm-utils mmm-compat
init-sessions desktop frameset init-windows init-company
init-hippie-expand init-minibuffer init-recentf init-flycheck
init-ibuffer ibuf-macs init-uniquify init-grep init-isearch init-dired
init-gui-frames ns-auto-titlebar init-osx-keys init-themes init-xterm
init-frame-hooks init-preload-local init-exec-path exec-path-from-shell
init-elpa fullframe finder-inf rx edmacro kmacro slime-autoloads info
package easymenu browse-url url-handlers url-parse auth-source eieio
eieio-core cl-macs eieio-loaddefs password-cache json subr-x map
url-vars seq byte-opt gv bytecomp byte-compile cconv init-site-lisp
cl-seq cl-loaddefs cl-lib init-utils init-benchmarking derived
early-init tooltip eldoc electric uniquify ediff-hook vc-hooks
lisp-float-type mwheel term/ns-win ns-win ucs-normalize mule-util
term/common-win tool-bar dnd fontset image regexp-opt fringe
tabulated-list replace newcomment text-mode elisp-mode lisp-mode
prog-mode register page tab-bar menu-bar rfn-eshadow isearch timer
select scroll-bar mouse jit-lock font-lock syntax facemenu font-core
term/tty-colors frame minibuffer cl-generic cham georgian utf-8-lang
misc-lang vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms
cp51932 hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese composite charscript charprop case-table epa-hook
jka-cmpr-hook help simple abbrev obarray cl-preloaded nadvice loaddefs
button faces cus-face macroexp files text-properties overlay sha1 md5
base64 format env code-pages mule custom widget hashtable-print-readable
backquote threads kqueue cocoa ns multi-tty make-network-process emacs)

Memory information:
((conses 16 632053 354268)
(symbols 48 59409 246)
(strings 32 197826 53863)
(string-bytes 1 5927719)
(vectors 16 69807)
(vector-slots 8 1717944 390092)
(floats 8 911 2031)
(intervals 56 3152 3510)
(buffers 1000 32))





^ permalink raw reply	[flat|nested] 3+ messages in thread

* bug#50247: 27.2; wrong `word-wrap' for Chinese characters
  2021-08-29  3:14 bug#50247: 27.2; wrong `word-wrap' for Chinese characters ClaudeMonet via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2021-08-29  7:26 ` Eli Zaretskii
  2021-09-27 10:50   ` Lars Ingebrigtsen
  0 siblings, 1 reply; 3+ messages in thread
From: Eli Zaretskii @ 2021-08-29  7:26 UTC (permalink / raw)
  To: ClaudeMonet; +Cc: 50247

> Date: Sun, 29 Aug 2021 11:14:40 +0800
> From:  ClaudeMonet via "Bug reports for GNU Emacs,
>  the Swiss army knife of text editors" <bug-gnu-emacs@gnu.org>
> 
> When `toggle-word-wrap' is enabled, lines that ends with Chinese
> characters and Chinese punctuations won't be seperated in the right
> way, "normally", all Chinese words in a sentence will be crowded and
> recognized by Emacs as one single WORD.
> 
> e.g. "世界" is a word in
> Chinese, and "世界人民大团结万岁。" is a full sentence ending with a
> full width perid, and Emacs would recognize the sentence as a word, thus
> wrap lines in a wrong way.

Emacs 28 introduces the variable word-wrap-by-category; if you set
that non-nil, the above should work as you expect, assuming the
Kinsoku rules are good enough for that.  (Since you didn't tell in
detail what were your expectation of the "right way" in this case, I
couldn't actually test that the results are as you expect.)

> By the way, I think this one have long been a problem for Chinese users,
> since we use full-width punctuation system instead in English half-width
> is more generally adopted.

Please elaborate in what way this presents a problem in Emacs,
preferably with examples.

> Another thing is, in Emacs when you use
> `forward-word' key binding, I know English words are all separated
> either by punctuations or blank characters(<space>, <tab>, etc.), but in
> Chinese, words in a single sentence are usually separated by nothing, I
> don't know what the normal practice for "word recognizing" tasks is on
> modern OS like Mac and Windows. I guess there is a dictionary mechanism.

Emacs has find-word-boundary-function-table, which can be used to
define our rules.  In general, we try to follow Unicode, but AFAIU
Unicode TR29 doesn't specify any word-breaking rules for Chinese
characters.

> A footnote here, for tokenizing Chinese words, there is a Python
> tokenizor called "jieba" in NLP field, would be a great reference if you
> guys are going to address this issue. The github link of "jieba" is:
> 
> 	https://github.com/fxsjy/jieba

Patches are welcome to add Chinese text segmentation capabilities to
Emacs.





^ permalink raw reply	[flat|nested] 3+ messages in thread

* bug#50247: 27.2; wrong `word-wrap' for Chinese characters
  2021-08-29  7:26 ` Eli Zaretskii
@ 2021-09-27 10:50   ` Lars Ingebrigtsen
  0 siblings, 0 replies; 3+ messages in thread
From: Lars Ingebrigtsen @ 2021-09-27 10:50 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: ClaudeMonet, 50247

Eli Zaretskii <eliz@gnu.org> writes:

> Emacs 28 introduces the variable word-wrap-by-category; if you set
> that non-nil, the above should work as you expect, assuming the
> Kinsoku rules are good enough for that.  (Since you didn't tell in
> detail what were your expectation of the "right way" in this case, I
> couldn't actually test that the results are as you expect.)
>
>> By the way, I think this one have long been a problem for Chinese users,
>> since we use full-width punctuation system instead in English half-width
>> is more generally adopted.
>
> Please elaborate in what way this presents a problem in Emacs,
> preferably with examples.

More information was requested, but no response was given within a
month, so I'm closing this bug report.  If the problem still exists,
please respond to this email and we'll reopen the bug report.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-09-27 10:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-08-29  3:14 bug#50247: 27.2; wrong `word-wrap' for Chinese characters ClaudeMonet via Bug reports for GNU Emacs, the Swiss army knife of text editors
2021-08-29  7:26 ` Eli Zaretskii
2021-09-27 10:50   ` Lars Ingebrigtsen

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.