emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
From: Ihor Radchenko <yantar92@posteo.net>
To: Christian Moe <mail@christianmoe.com>
Cc: Joseph Turner <joseph@breatheoutbreathe.in>,
	Org Mode Mailing List <emacs-orgmode@gnu.org>,
	Bohong Huang <bohonghuang@qq.com>
Subject: Re: Form feed characters break odt export
Date: Tue, 24 Dec 2024 14:14:17 +0000	[thread overview]
Message-ID: <87jzbpgocm.fsf@localhost> (raw)
In-Reply-To: <87o711l4u4.fsf@christianmoe.com>

[-- Attachment #1: Type: text/plain, Size: 520 bytes --]

Christian Moe <mail@christianmoe.com> writes:

> I don't think it's specific to ODT or LibreOffice, it's the underlying
> XML 1.0 spec that "discourages" control characters and does not include
> #xC in the range of characters that XML processors must accept.
>
> Spec: https://www.w3.org/TR/REC-xml/#charsets
>
> Some discussion:
> https://stackoverflow.com/questions/404107/why-are-control-characters-illegal-in-xml-1-0

Thanks!
Then, we can simply remove the disallowed characters.
See the attached tentative patch.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-ox-odt-Avoid-putting-forbidden-characters-into-ODT-x.patch --]
[-- Type: text/x-patch, Size: 3707 bytes --]

From 5c9fdd9df32b87be9f81e037336332984bc3b16c Mon Sep 17 00:00:00 2001
Message-ID: <5c9fdd9df32b87be9f81e037336332984bc3b16c.1735049605.git.yantar92@posteo.net>
From: Ihor Radchenko <yantar92@posteo.net>
Date: Tue, 24 Dec 2024 15:11:22 +0100
Subject: [PATCH] ox-odt: Avoid putting forbidden  characters into ODT xml

* lisp/ox-odt.el (org-odt-forbidden-char-re):
(org-odt-discouraged-char-re): New constants codifying characters that
are prohibited in XML spec.
(org-odt--remove-forbidden): New function removing the prohibited
characters.
(org-odt--encode-plain-text): Remove the prohibited characters.
(org-odt-plain-text): Update comment.

Reported-by: Joseph Turner <joseph@breatheoutbreathe.in>
Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com
---
 lisp/ox-odt.el | 38 +++++++++++++++++++++++++++++++++++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el
index ec81637ef0..61c8d4ec75 100644
--- a/lisp/ox-odt.el
+++ b/lisp/ox-odt.el
@@ -170,6 +170,28 @@ (defconst org-odt-special-string-regexps
     ("\\.\\.\\." . "&#x2026;"))		; hellip
   "Regular expressions for special string conversion.")
 
+(defconst org-odt-forbidden-char-re
+  (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D}
+             (?\N{U+20} . ?\N{U+D7FF})
+             (?\N{U+E000} . ?\N{U+FFFD})
+             (?\N{U+10000} . ?\N{U+10FFFF}))))
+  "Regexp matching forbidden XML1.0 characters.
+https://www.w3.org/TR/REC-xml/#charsets")
+
+(defconst org-odt-discouraged-char-re
+  (rx (in (?\N{U+7F} . ?\N{U+84}) (?\N{U+86} . ?\N{U+9F})
+	  (?\N{U+FDD0} . ?\N{U+FDEF}) (?\N{U+1FFFE} . ?\N{U+1FFFF})
+          (?\N{U+2FFFE} . ?\N{U+2FFFF}) (?\N{U+3FFFE} . ?\N{U+3FFFF})
+	  (?\N{U+4FFFE} . ?\N{U+4FFFF}) (?\N{U+5FFFE} . ?\N{U+5FFFF})
+          (?\N{U+6FFFE} . ?\N{U+6FFFF}) (?\N{U+7FFFE} . ?\N{U+7FFFF})
+          (?\N{U+8FFFE} . ?\N{U+8FFFF}) (?\N{U+9FFFE} . ?\N{U+9FFFF})
+	  (?\N{U+AFFFE} . ?\N{U+AFFFF}) (?\N{U+BFFFE} . ?\N{U+BFFFF})
+          (?\N{U+CFFFE} . ?\N{U+CFFFF}) (?\N{U+DFFFE} . ?\N{U+DFFFF})
+          (?\N{U+EFFFE} . ?\N{U+EFFFF}) (?\N{U+FFFFE} . ?\N{U+FFFFF})
+	  (?\N{U+10FFFE} . ?\N{U+10FFFF})))
+  "Regexp matching discouraged XML1.0 characters.
+https://www.w3.org/TR/REC-xml/#charsets")
+
 (defconst org-odt-schema-dir-list
   (list (expand-file-name "./schema/" org-odt-data-dir))
   "List of directories to search for OpenDocument schema files.
@@ -2892,18 +2914,28 @@ (defun org-odt--encode-tabs-and-spaces (line)
        (format " <text:s text:c=\"%d\"/>" (1- (length s)))))
    line))
 
+(defun org-odt--remove-forbidden (text)
+  "Remove forbidden and discouraged characters from TEXT.
+https://www.w3.org/TR/REC-xml/#charsets"
+  (replace-regexp-in-string
+   org-odt-forbidden-char-re ""
+   (replace-regexp-in-string
+    org-odt-discouraged-char-re ""
+    text)))
+
 (defun org-odt--encode-plain-text (text &optional no-whitespace-filling)
   (dolist (pair '(("&" . "&amp;") ("<" . "&lt;") (">" . "&gt;")))
     (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t)))
-  (if no-whitespace-filling text
-    (org-odt--encode-tabs-and-spaces text)))
+  (org-odt--remove-forbidden
+   (if no-whitespace-filling text
+     (org-odt--encode-tabs-and-spaces text))))
 
 (defun org-odt-plain-text (text info)
   "Transcode a TEXT string from Org to ODT.
 TEXT is the string to transcode.  INFO is a plist holding
 contextual information."
   (let ((output text))
-    ;; Protect &, < and >.
+    ;; Protect &, < and >, and remove forbidden characters.
     (setq output (org-odt--encode-plain-text output t))
     ;; Handle smart quotes.  Be sure to provide original string since
     ;; OUTPUT may have been modified.
-- 
2.47.1


[-- Attachment #3: Type: text/plain, Size: 223 bytes --]


-- 
Ihor Radchenko // yantar92,
Org mode maintainer,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

  reply	other threads:[~2024-12-24 14:13 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-21  1:48 Form feed characters break odt export Joseph Turner via General discussions about Org-mode.
2024-12-21  3:56 ` Max Nikulin
2024-12-21  6:52   ` Joseph Turner
2024-12-21  7:23     ` Max Nikulin
2024-12-21 19:06       ` Joseph Turner
2024-12-24 16:23   ` Max Nikulin
2024-12-25 10:16     ` Joseph Turner
2024-12-23 17:32 ` Ihor Radchenko
2024-12-24 11:04   ` Christian Moe
2024-12-24 14:14     ` Ihor Radchenko [this message]
2024-12-25 10:10       ` Joseph Turner
2024-12-24 14:25     ` Max Nikulin
2024-12-24 14:30       ` Ihor Radchenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.orgmode.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jzbpgocm.fsf@localhost \
    --to=yantar92@posteo.net \
    --cc=bohonghuang@qq.com \
    --cc=emacs-orgmode@gnu.org \
    --cc=joseph@breatheoutbreathe.in \
    --cc=mail@christianmoe.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).