* Bug: ODT export of Chinese text inserts spaces for line breaks
@ 2021-06-29 3:47 James Harkins
2021-06-29 4:43 ` tumashu
0 siblings, 1 reply; 7+ messages in thread
From: James Harkins @ 2021-06-29 3:47 UTC (permalink / raw)
To: emacs-orgmode
Consider the following org document.
* Test
1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证
书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关
要求办理离校手续,领取相关证书后离校;
This was produced by pasting in a single, long line, and then using alt-Q (a normal thing to do, and good for readability, because org-mode doesn't wrap lines by default).
Exporting to ODT produces the following (body text, omitting titles, headers and such).
1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关 要求办理离校手续,领取相关证书后离校;
Between 证 and 书, and between 关 and 要, there is a space. Chinese typography does not allow for spaces mid-sentence.
So, it would make sense to add a rule to the exporter: if one of the characters before or after a source-text line break is a Chinese, Japanese or Korean character, do not add a space. (The space is valid, of course, if the characters on either side of the line breaks are Roman or [I would guess] Cyrillic as well.)
(Side note: Exporting to a LaTeX buffer shows that the line breaks have been copied into the .tex document as is -- but, provided that you have a `usepackage{xeCJK}` in the preamble, LaTeX produces correct, space-free output. So -- Org "gets away with it" because of LaTeX's handling of CJK text. It seems for ODT, Org needs to handle the spacing within its own logic.)
This is org 9.1.9... bit old, I know, but I'm gonna take a wild guess that this has not been a high-visibility issue.
hjh
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re:Bug: ODT export of Chinese text inserts spaces for line breaks
2021-06-29 3:47 Bug: ODT export of Chinese text inserts spaces for line breaks James Harkins
@ 2021-06-29 4:43 ` tumashu
2021-06-29 17:01 ` Bug: " Maxim Nikulin
0 siblings, 1 reply; 7+ messages in thread
From: tumashu @ 2021-06-29 4:43 UTC (permalink / raw)
To: James Harkins; +Cc: emacs-orgmode
[-- Attachment #1: Type: text/plain, Size: 2924 bytes --]
You can try the below config :-)
(defun eh-org-wash-text (text backend _info)
"导出 org file 时,删除中文之间不必要的空格。"
(when (or (org-export-derived-backend-p backend 'html)
(org-export-derived-backend-p backend 'odt))
(let ((regexp "[[:multibyte:]]")
(string text))
;; org-mode 默认将一个换行符转换为空格,但中文不需要这个空格,删除。
(setq string
(replace-regexp-in-string
(format "\\(%s\\) *\n *\\(%s\\)" regexp regexp)
"\\1\\2" string))
;; 删除粗体之后的空格
(dolist (str '("</b>" "</code>" "</del>" "</i>"))
(setq string
(replace-regexp-in-string
(format "\\(%s\\)\\(%s\\)[ ]+\\(%s\\)" regexp str regexp)
"\\1\\2\\3" string)))
;; 删除粗体之前的空格
(dolist (str '("<b>" "<code>" "<del>" "<i>" "<span class=\"underline\">"))
(setq string
(replace-regexp-in-string
(format "\\(%s\\)[ ]+\\(%s\\)\\(%s\\)" regexp str regexp)
"\\1\\2\\3" string)))
string)))
(add-hook 'org-export-filter-headline-functions #'eh-org-wash-text)
(add-hook 'org-export-filter-paragraph-functions #'eh-org-wash-text)
在 2021-06-29 11:47:06,"James Harkins" <jamshark70@zoho.com> 写道:
>Consider the following org document.
>
>* Test
>1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证
>书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关
>要求办理离校手续,领取相关证书后离校;
>
>This was produced by pasting in a single, long line, and then using alt-Q (a normal thing to do, and good for readability, because org-mode doesn't wrap lines by default).
>
>Exporting to ODT produces the following (body text, omitting titles, headers and such).
>
>1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关 要求办理离校手续,领取相关证书后离校;
>
>Between 证 and 书, and between 关 and 要, there is a space. Chinese typography does not allow for spaces mid-sentence.
>
>So, it would make sense to add a rule to the exporter: if one of the characters before or after a source-text line break is a Chinese, Japanese or Korean character, do not add a space. (The space is valid, of course, if the characters on either side of the line breaks are Roman or [I would guess] Cyrillic as well.)
>
>(Side note: Exporting to a LaTeX buffer shows that the line breaks have been copied into the .tex document as is -- but, provided that you have a `usepackage{xeCJK}` in the preamble, LaTeX produces correct, space-free output. So -- Org "gets away with it" because of LaTeX's handling of CJK text. It seems for ODT, Org needs to handle the spacing within its own logic.)
>
>This is org 9.1.9... bit old, I know, but I'm gonna take a wild guess that this has not been a high-visibility issue.
>
>hjh
[-- Attachment #2: Type: text/html, Size: 4423 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Bug: ODT export of Chinese text inserts spaces for line breaks
2021-06-29 4:43 ` tumashu
@ 2021-06-29 17:01 ` Maxim Nikulin
2021-06-29 18:19 ` Eric Abrahamsen
0 siblings, 1 reply; 7+ messages in thread
From: Maxim Nikulin @ 2021-06-29 17:01 UTC (permalink / raw)
To: emacs-orgmode
On 29/06/2021 10:47, James Harkins wrote:
> So, it would make sense to add a rule to the exporter: if one of the
> characters before or after a source-text line break is a Chinese,
> Japanese or Korean character, do not add a space.
On 29/06/2021 11:43, tumashu wrote:
> You can try the below config :-)
> (let ((regexp "[[:multibyte:]]")
> (string text))
> (setq string
> (replace-regexp-in-string
> (format "\\(%s\\) *\n *\\(%s\\)" regexp regexp)
> "\\1\\2" string))
Notice that [[:multibyte:]] means almost any non-ASCII script, e.g.
Cyrillic:
(let ((sample "abc абв def"))
(and (string-match "[[:multibyte:]]\+" sample)
(match-string 0 sample)))
"абв"
It seems, `org-fill-paragraph' M-q is smart enough to avoid a space
before or after a CJK character, so it is possible to determine correct
way to splice lines, despite e.g. "Script" Unicode property is not
exposed to elisp:
https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html
(Anyway maintaining explicit list of scripts is not a straightforward
approach.)
P.S.
JavaScript in browsers allows to filter characters that belong to
particular script:
"abc абв def".match(/\p{Script=Cyrillic}+/u)
Array [ "абв" ]
I have not found such feature in regular expressions available in Emacs.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Bug: ODT export of Chinese text inserts spaces for line breaks
2021-06-29 17:01 ` Bug: " Maxim Nikulin
@ 2021-06-29 18:19 ` Eric Abrahamsen
2021-06-30 12:22 ` Maxim Nikulin
0 siblings, 1 reply; 7+ messages in thread
From: Eric Abrahamsen @ 2021-06-29 18:19 UTC (permalink / raw)
To: Maxim Nikulin; +Cc: emacs-orgmode
Maxim Nikulin <manikulin@gmail.com> writes:
> On 29/06/2021 10:47, James Harkins wrote:
>> So, it would make sense to add a rule to the exporter: if one of the
>> characters before or after a source-text line break is a Chinese,
>> Japanese or Korean character, do not add a space.
>
> On 29/06/2021 11:43, tumashu wrote:
>> You can try the below config :-)
>> (let ((regexp "[[:multibyte:]]")
>> (string text))
>> (setq string
>> (replace-regexp-in-string
>> (format "\\(%s\\) *\n *\\(%s\\)" regexp regexp)
>> "\\1\\2" string))
>
> Notice that [[:multibyte:]] means almost any non-ASCII script, e.g.
> Cyrillic:
>
> (let ((sample "abc абв def"))
> (and (string-match "[[:multibyte:]]\+" sample)
> (match-string 0 sample)))
> "абв"
>
> It seems, `org-fill-paragraph' M-q is smart enough to avoid a space
> before or after a CJK character, so it is possible to determine
> correct way to splice lines, despite e.g. "Script" Unicode property is
> not exposed to elisp:
> https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html
> (Anyway maintaining explicit list of scripts is not a straightforward
> approach.)
There are a few ways to approach this:
(aref char-script-table ?中) -> 'han
(string-match-p "\\cc" "中") -> 0
(aref (char-category-set ?中) ?|) -> t
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Bug: ODT export of Chinese text inserts spaces for line breaks
2021-06-29 18:19 ` Eric Abrahamsen
@ 2021-06-30 12:22 ` Maxim Nikulin
2022-10-08 13:14 ` Ihor Radchenko
0 siblings, 1 reply; 7+ messages in thread
From: Maxim Nikulin @ 2021-06-30 12:22 UTC (permalink / raw)
To: emacs-orgmode
On 29/06/2021 10:47, James Harkins wrote:
> * Test
> 1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证
> 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关
> 要求办理离校手续,领取相关证书后离校;
> Exporting to ODT produces the following (body text, omitting titles,
> headers and such).
>
> 1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关 要求办理离校手续,领取相关证书后离校;
Confirmed: newlines are copied to ODT document as is and they appear as
spaces in libreoffice. I did not tried HTML since I am unsure if
browsers should glue paragraphs with newlines into continuous string
without spaces. Maybe it is necessary to add some attributes for proper
representation (e.g. "lang"), however "#+LANGUAGE: cn" does not help
even though libreoffice considers paragraph as Chinese.
On 30/06/2021 01:19, Eric Abrahamsen wrote:
> There are a few ways to approach this:
>
> (aref char-script-table ?中) -> 'han
> (string-match-p "\\cc" "中") -> 0
> (aref (char-category-set ?中) ?|) -> t
Thank you. I have not noticed all features hidden behind \c. I believe,
(rx (category can-break))
is more readable and I am a bit surprised that there is no descriptive
aliases char-categories such as ?|. Just to add another example:
(category-set-mnemonics (char-category-set ?ф)) -> ".LYchjy"
and `describe-categories' to decipher it.
As to splicing lines, I found `fill-delete-newlines' that uses
`fill-nospace-between-words-table' besides ?| category to determine
whether space should be suppressed while splicing lines. In addition
there are some variables to tune behavior.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Bug: ODT export of Chinese text inserts spaces for line breaks
2021-06-30 12:22 ` Maxim Nikulin
@ 2022-10-08 13:14 ` Ihor Radchenko
2022-10-21 5:38 ` Ihor Radchenko
0 siblings, 1 reply; 7+ messages in thread
From: Ihor Radchenko @ 2022-10-08 13:14 UTC (permalink / raw)
To: Maxim Nikulin; +Cc: emacs-orgmode
[-- Attachment #1: Type: text/plain, Size: 1642 bytes --]
Maxim Nikulin <manikulin@gmail.com> writes:
> On 29/06/2021 10:47, James Harkins wrote:
>> * Test
>> 1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证
>> 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关
>> 要求办理离校手续,领取相关证书后离校;
>
>> Exporting to ODT produces the following (body text, omitting titles,
>> headers and such).
>>
>> 1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关 要求办理离校手续,领取相关证书后离校;
>
> Confirmed: newlines are copied to ODT document as is and they appear as
> spaces in libreoffice. I did not tried HTML since I am unsure if
> browsers should glue paragraphs with newlines into continuous string
> without spaces. Maybe it is necessary to add some attributes for proper
> representation (e.g. "lang"), however "#+LANGUAGE: cn" does not help
> even though libreoffice considers paragraph as Chinese.
Newlines appearing as spaces is in ODT schema.
> As to splicing lines, I found `fill-delete-newlines' that uses
> `fill-nospace-between-words-table' besides ?| category to determine
> whether space should be suppressed while splicing lines. In addition
> there are some variables to tune behavior.
I am attaching the fix that leverages `fill-region' to handle all the
complexities for us. It is the easiest way and I see no reason to look
deeper.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-ox-odt-Fix-newlines-replaced-by-spaces-in-Han-script.patch --]
[-- Type: text/x-patch, Size: 1833 bytes --]
From 614944ba1ac5502c7648747363674b8d45bfaaf7 Mon Sep 17 00:00:00 2001
Message-Id: <614944ba1ac5502c7648747363674b8d45bfaaf7.1665234699.git.yantar92@gmail.com>
From: Ihor Radchenko <yantar92@gmail.com>
Date: Sat, 8 Oct 2022 21:08:47 +0800
Subject: [PATCH] ox-odt: Fix newlines replaced by spaces in Han script
* lisp/ox-odt.el (org-odt-plain-text): Use `fill-region' to unfill the
paragraphs with newlines accounting for scripts without spaces between
words.
Reported-by: James Harkins <jamshark70@zoho.com>
Link: https://orgmode.org/list/sbhnlv$4t1$1@ciao.gmane.io
---
lisp/ox-odt.el | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el
index 208a39d9d..c989d2014 100644
--- a/lisp/ox-odt.el
+++ b/lisp/ox-odt.el
@@ -2903,9 +2903,20 @@ (defun org-odt-plain-text (text info)
(setq output
(replace-regexp-in-string (car pair) (cdr pair) output t nil))))
;; Handle break preservation if required.
- (when (plist-get info :preserve-breaks)
- (setq output (replace-regexp-in-string
- "\\(\\\\\\\\\\)?[ \t]*\n" "<text:line-break/>" output t)))
+ (if (plist-get info :preserve-breaks)
+ (setq output (replace-regexp-in-string
+ "\\(\\\\\\\\\\)?[ \t]*\n" "<text:line-break/>" output t))
+ ;; OpenDocument schema recognizes newlines as spaces, which may
+ ;; not be desired in scripts that do not separate words with
+ ;; spaces (for example, Han script). `fill-region' is able to
+ ;; handle such situations.
+ (setq output
+ (with-temp-buffer
+ (insert output)
+ ;; Unfill.
+ (let ((fill-column (point-max)))
+ (fill-region (point-min) (point-max)))
+ (buffer-string))))
;; Return value.
output))
--
2.35.1
[-- Attachment #3: Type: text/plain, Size: 224 bytes --]
--
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: Bug: ODT export of Chinese text inserts spaces for line breaks
2022-10-08 13:14 ` Ihor Radchenko
@ 2022-10-21 5:38 ` Ihor Radchenko
0 siblings, 0 replies; 7+ messages in thread
From: Ihor Radchenko @ 2022-10-21 5:38 UTC (permalink / raw)
To: Ihor Radchenko; +Cc: Maxim Nikulin, emacs-orgmode
Ihor Radchenko <yantar92@gmail.com> writes:
> I am attaching the fix that leverages `fill-region' to handle all the
> complexities for us. It is the easiest way and I see no reason to look
> deeper.
Applied onto main.
https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=3502ce2dbb29b70cdbb978d144322d48cb00f26d
--
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-10-21 5:48 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-06-29 3:47 Bug: ODT export of Chinese text inserts spaces for line breaks James Harkins
2021-06-29 4:43 ` tumashu
2021-06-29 17:01 ` Bug: " Maxim Nikulin
2021-06-29 18:19 ` Eric Abrahamsen
2021-06-30 12:22 ` Maxim Nikulin
2022-10-08 13:14 ` Ihor Radchenko
2022-10-21 5:38 ` Ihor Radchenko
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.