From: Joseph Turner <joseph@breatheoutbreathe.in>
To: Ihor Radchenko <yantar92@posteo.net>
Cc: Christian Moe <mail@christianmoe.com>,
Org Mode Mailing List <emacs-orgmode@gnu.org>,
Bohong Huang <bohonghuang@qq.com>
Subject: Re: Form feed characters break odt export
Date: Wed, 25 Dec 2024 02:10:14 -0800 [thread overview]
Message-ID: <878qs4oyyh.fsf@breatheoutbreathe.in> (raw)
In-Reply-To: <87jzbpgocm.fsf@localhost> (Ihor Radchenko's message of "Tue, 24 Dec 2024 14:14:17 +0000")
Ihor Radchenko <yantar92@posteo.net> writes:
> Christian Moe <mail@christianmoe.com> writes:
>
>> I don't think it's specific to ODT or LibreOffice, it's the underlying
>> XML 1.0 spec that "discourages" control characters and does not include
>> #xC in the range of characters that XML processors must accept.
>>
>> Spec: https://www.w3.org/TR/REC-xml/#charsets
>>
>> Some discussion:
>> https://stackoverflow.com/questions/404107/why-are-control-characters-illegal-in-xml-1-0
>
> Thanks!
> Then, we can simply remove the disallowed characters.
> See the attached tentative patch.
>
> From 5c9fdd9df32b87be9f81e037336332984bc3b16c Mon Sep 17 00:00:00 2001
> Message-ID: <5c9fdd9df32b87be9f81e037336332984bc3b16c.1735049605.git.yantar92@posteo.net>
> From: Ihor Radchenko <yantar92@posteo.net>
> Date: Tue, 24 Dec 2024 15:11:22 +0100
> Subject: [PATCH] ox-odt: Avoid putting forbidden characters into ODT xml
>
> * lisp/ox-odt.el (org-odt-forbidden-char-re):
> (org-odt-discouraged-char-re): New constants codifying characters that
> are prohibited in XML spec.
> (org-odt--remove-forbidden): New function removing the prohibited
> characters.
> (org-odt--encode-plain-text): Remove the prohibited characters.
> (org-odt-plain-text): Update comment.
>
> Reported-by: Joseph Turner <joseph@breatheoutbreathe.in>
> Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com
> ---
> lisp/ox-odt.el | 38 +++++++++++++++++++++++++++++++++++---
> 1 file changed, 35 insertions(+), 3 deletions(-)
>
> diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el
> index ec81637ef0..61c8d4ec75 100644
> --- a/lisp/ox-odt.el
> +++ b/lisp/ox-odt.el
> @@ -170,6 +170,28 @@ (defconst org-odt-special-string-regexps
> ("\\.\\.\\." . "…")) ; hellip
> "Regular expressions for special string conversion.")
>
> +(defconst org-odt-forbidden-char-re
> + (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D}
> + (?\N{U+20} . ?\N{U+D7FF})
> + (?\N{U+E000} . ?\N{U+FFFD})
> + (?\N{U+10000} . ?\N{U+10FFFF}))))
> + "Regexp matching forbidden XML1.0 characters.
> +https://www.w3.org/TR/REC-xml/#charsets")
> +
> +(defconst org-odt-discouraged-char-re
> + (rx (in (?\N{U+7F} . ?\N{U+84}) (?\N{U+86} . ?\N{U+9F})
> + (?\N{U+FDD0} . ?\N{U+FDEF}) (?\N{U+1FFFE} . ?\N{U+1FFFF})
> + (?\N{U+2FFFE} . ?\N{U+2FFFF}) (?\N{U+3FFFE} . ?\N{U+3FFFF})
> + (?\N{U+4FFFE} . ?\N{U+4FFFF}) (?\N{U+5FFFE} . ?\N{U+5FFFF})
> + (?\N{U+6FFFE} . ?\N{U+6FFFF}) (?\N{U+7FFFE} . ?\N{U+7FFFF})
> + (?\N{U+8FFFE} . ?\N{U+8FFFF}) (?\N{U+9FFFE} . ?\N{U+9FFFF})
> + (?\N{U+AFFFE} . ?\N{U+AFFFF}) (?\N{U+BFFFE} . ?\N{U+BFFFF})
> + (?\N{U+CFFFE} . ?\N{U+CFFFF}) (?\N{U+DFFFE} . ?\N{U+DFFFF})
> + (?\N{U+EFFFE} . ?\N{U+EFFFF}) (?\N{U+FFFFE} . ?\N{U+FFFFF})
> + (?\N{U+10FFFE} . ?\N{U+10FFFF})))
> + "Regexp matching discouraged XML1.0 characters.
> +https://www.w3.org/TR/REC-xml/#charsets")
> +
> (defconst org-odt-schema-dir-list
> (list (expand-file-name "./schema/" org-odt-data-dir))
> "List of directories to search for OpenDocument schema files.
> @@ -2892,18 +2914,28 @@ (defun org-odt--encode-tabs-and-spaces (line)
> (format " <text:s text:c=\"%d\"/>" (1- (length s)))))
> line))
>
> +(defun org-odt--remove-forbidden (text)
> + "Remove forbidden and discouraged characters from TEXT.
> +https://www.w3.org/TR/REC-xml/#charsets"
> + (replace-regexp-in-string
> + org-odt-forbidden-char-re ""
> + (replace-regexp-in-string
> + org-odt-discouraged-char-re ""
> + text)))
> +
> (defun org-odt--encode-plain-text (text &optional no-whitespace-filling)
> (dolist (pair '(("&" . "&") ("<" . "<") (">" . ">")))
> (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t)))
> - (if no-whitespace-filling text
> - (org-odt--encode-tabs-and-spaces text)))
> + (org-odt--remove-forbidden
> + (if no-whitespace-filling text
> + (org-odt--encode-tabs-and-spaces text))))
>
> (defun org-odt-plain-text (text info)
> "Transcode a TEXT string from Org to ODT.
> TEXT is the string to transcode. INFO is a plist holding
> contextual information."
> (let ((output text))
> - ;; Protect &, < and >.
> + ;; Protect &, < and >, and remove forbidden characters.
> (setq output (org-odt--encode-plain-text output t))
> ;; Handle smart quotes. Be sure to provide original string since
> ;; OUTPUT may have been modified.
> --
> 2.47.1
Thanks, Ihor! Tested working on my machine.
Here's another potential solution to consider, which adds a defcustom to
let the user decide how to handle forbidden characters:
https://github.com/kjambunathan/org-mode-ox-odt/commit/07fde1e9b7cdda3e3ef8136f5b1d478499dfd780
Joseph
next prev parent reply other threads:[~2024-12-25 10:11 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-21 1:48 Form feed characters break odt export Joseph Turner via General discussions about Org-mode.
2024-12-21 3:56 ` Max Nikulin
2024-12-21 6:52 ` Joseph Turner
2024-12-21 7:23 ` Max Nikulin
2024-12-21 19:06 ` Joseph Turner
2024-12-24 16:23 ` Max Nikulin
2024-12-25 10:16 ` Joseph Turner
2024-12-23 17:32 ` Ihor Radchenko
2024-12-24 11:04 ` Christian Moe
2024-12-24 14:14 ` Ihor Radchenko
2024-12-25 10:10 ` Joseph Turner [this message]
2024-12-24 14:25 ` Max Nikulin
2024-12-24 14:30 ` Ihor Radchenko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=878qs4oyyh.fsf@breatheoutbreathe.in \
--to=joseph@breatheoutbreathe.in \
--cc=bohonghuang@qq.com \
--cc=emacs-orgmode@gnu.org \
--cc=mail@christianmoe.com \
--cc=yantar92@posteo.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.