* Form feed characters break odt export @ 2024-12-21 1:48 Joseph Turner via General discussions about Org-mode. 2024-12-21 3:56 ` Max Nikulin 2024-12-23 17:32 ` Ihor Radchenko 0 siblings, 2 replies; 18+ messages in thread From: Joseph Turner via General discussions about Org-mode. @ 2024-12-21 1:48 UTC (permalink / raw) To: Org Mode Mailing List; +Cc: Bohong Huang Tested on GNU Emacs 29.4 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.41, cairo version 1.18.0) Org mode version 9.7.6 (9.7.6-7a4527 @ /home/joseph/.emacs.d/elpa/org-9.7.6/) I can export the following Org content to a .odt file, but the exported file cannot be opened ("Read Error. Format error discovered in the file in sub-document content.xml at 368,2(row,col).") --8<---------------cut here---------------start------------->8--- #+TITLE: Foo * Bar Baz \f --8<---------------cut here---------------end--------------->8--- First reported by bohonghuang: https://github.com/bohonghuang/org-srs/pull/10#issuecomment-2557417871 Thanks! Joseph ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-21 1:48 Form feed characters break odt export Joseph Turner via General discussions about Org-mode. @ 2024-12-21 3:56 ` Max Nikulin 2024-12-21 6:52 ` Joseph Turner 2024-12-24 16:23 ` Max Nikulin 2024-12-23 17:32 ` Ihor Radchenko 1 sibling, 2 replies; 18+ messages in thread From: Max Nikulin @ 2024-12-21 3:56 UTC (permalink / raw) To: Joseph Turner, emacs-orgmode; +Cc: Bohong Huang On 21/12/2024 08:48, Joseph Turner wrote: > > I can export the following Org content to a .odt file, but the exported > file cannot be opened ("Read Error. Format error discovered in the file > in sub-document content.xml at 368,2(row,col).") [...] > First reported by bohonghuang: > https://github.com/bohonghuang/org-srs/pull/10#issuecomment-2557417871 In this specific context a workaround should be #+begin_comment ^L #+end_comment Or a commented out empty local variables block above. I have wrote already that I do not like non-printable characters in Org files. I admit that special characters either should cause `org-lint' warnings or should be filtered out by exporters. Specifically to ^L, there was a request to treat it as a page break by all exporters (I would prefer some entity or macro instead to not deviate from plain text markup). Marvin Gülker. Feature request: export form feed as page break. Sat, 21 Oct 2023 09:42:33 +0200. <https://list.orgmode.org/87zg0ce6yi.fsf@guelker.eu> I have not had a close look at another proposed feature, but I suspect that it might make filtering special characters more tricky. (I would be happy to hear that I am wrong.) Nathaniel Nicandro. [PATCH] ANSI color on example blocks and fixed width elements. Wed, 05 Apr 2023 07:03:43 -0500. <https://list.orgmode.org/874jpuijpc.fsf@gmail.com> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-21 3:56 ` Max Nikulin @ 2024-12-21 6:52 ` Joseph Turner 2024-12-21 7:23 ` Max Nikulin 2024-12-24 16:23 ` Max Nikulin 1 sibling, 1 reply; 18+ messages in thread From: Joseph Turner @ 2024-12-21 6:52 UTC (permalink / raw) To: emacs-orgmode; +Cc: Bohong Huang, Max Nikulin Max Nikulin <manikulin@gmail.com> writes: > On 21/12/2024 08:48, Joseph Turner wrote: >> I can export the following Org content to a .odt file, but the >> exported >> file cannot be opened ("Read Error. Format error discovered in the file >> in sub-document content.xml at 368,2(row,col).") > [...] >> First reported by bohonghuang: >> https://github.com/bohonghuang/org-srs/pull/10#issuecomment-2557417871 > > In this specific context a workaround should be > > #+begin_comment > ^L > #+end_comment Thank you! Or even simpler: # ^L > Or a commented out empty local variables block above. > > I have wrote already that I do not like non-printable characters in > Org files. I agree that they make Org files less portable outside Emacs, and they complicate org-export. > I admit that special characters either should cause `org-lint' > warnings or should be filtered out by exporters. > > Specifically to ^L, there was a request to treat it as a page break by > all exporters (I would prefer some entity or macro instead to not > deviate from plain text markup). > > Marvin Gülker. Feature request: export form feed as page break. Sat, > 21 Oct 2023 09:42:33 +0200. > <https://list.orgmode.org/87zg0ce6yi.fsf@guelker.eu> > > I have not had a close look at another proposed feature, but I suspect > that it might make filtering special characters more tricky. (I would > be happy to hear that I am wrong.) Yes. Without digging into it, my gut feeling is also that handling one non-printable character specially would open Pandora's box. > Nathaniel Nicandro. [PATCH] ANSI color on example blocks and fixed > width elements. Wed, 05 Apr 2023 07:03:43 -0500. > <https://list.orgmode.org/874jpuijpc.fsf@gmail.com> Gratefully, Joseph ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-21 6:52 ` Joseph Turner @ 2024-12-21 7:23 ` Max Nikulin 2024-12-21 19:06 ` Joseph Turner 0 siblings, 1 reply; 18+ messages in thread From: Max Nikulin @ 2024-12-21 7:23 UTC (permalink / raw) To: Joseph Turner, emacs-orgmode; +Cc: Bohong Huang On 21/12/2024 13:52, Joseph Turner wrote: > Max Nikulin writes: >> >> #+begin_comment >> ^L >> #+end_comment > Thank you! Or even simpler: > > # ^L It was first I tried, but Emacs-28.2 demands to decide if Local Variables should be applied. You may ask Emacs developers for a *plain text* spell to stop processing of local variables (or to take *last* found block). Notice that commit diff looks confusing. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-21 7:23 ` Max Nikulin @ 2024-12-21 19:06 ` Joseph Turner 0 siblings, 0 replies; 18+ messages in thread From: Joseph Turner @ 2024-12-21 19:06 UTC (permalink / raw) To: emacs-orgmode; +Cc: Bohong Huang, Max Nikulin Max Nikulin <manikulin@gmail.com> writes: > On 21/12/2024 13:52, Joseph Turner wrote: >> Max Nikulin writes: >>> >>> #+begin_comment >>> ^L >>> #+end_comment > >> Thank you! Or even simpler: >> # ^L > > It was first I tried, but Emacs-28.2 demands to decide if Local > Variables should be applied. Oops! You're right. The form feed needs to be at the beginning of the line. > You may ask Emacs developers for a *plain text* spell to stop > processing of local variables (or to take *last* found block). Good idea: https://yhetil.org/emacs-devel/87ttawewpx.fsf@breatheoutbreathe.in/T/#u > Notice that commit diff looks confusing. Joseph ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-21 3:56 ` Max Nikulin 2024-12-21 6:52 ` Joseph Turner @ 2024-12-24 16:23 ` Max Nikulin 2024-12-25 10:16 ` Joseph Turner 1 sibling, 1 reply; 18+ messages in thread From: Max Nikulin @ 2024-12-24 16:23 UTC (permalink / raw) To: Joseph Turner, emacs-orgmode; +Cc: Bohong Huang On 21/12/2024 10:56, Max Nikulin wrote: > On 21/12/2024 08:48, Joseph Turner wrote: >> https://github.com/bohonghuang/org-srs/pull/10#issuecomment-2557417871 > > In this specific context a workaround should be > > #+begin_comment > ^L > #+end_comment To avoid confusion of other contributors it should be more verbose: #+begin_comment Keep this block at the bottom of the file. It instructs Emacs to ignore examples of local variables sections above, see <info:emacs#Specifying File Variables> The following line contains the form feed 0x0c character. ^L #+end_comment ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-24 16:23 ` Max Nikulin @ 2024-12-25 10:16 ` Joseph Turner 0 siblings, 0 replies; 18+ messages in thread From: Joseph Turner @ 2024-12-25 10:16 UTC (permalink / raw) To: emacs-orgmode; +Cc: Bohong Huang Max Nikulin <manikulin@gmail.com> writes: > On 21/12/2024 10:56, Max Nikulin wrote: >> On 21/12/2024 08:48, Joseph Turner wrote: >>> https://github.com/bohonghuang/org-srs/pull/10#issuecomment-2557417871 >> In this specific context a workaround should be >> #+begin_comment >> ^L >> #+end_comment > > To avoid confusion of other contributors it should be more verbose: > > #+begin_comment > Keep this block at the bottom of the file. > It instructs Emacs to ignore examples > of local variables sections above, see > <info:emacs#Specifying File Variables> > The following line contains the form feed 0x0c character. > ^L > #+end_comment Thank you, Max. Submitted PR: https://github.com/bohonghuang/org-srs/pull/12 Joseph ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-21 1:48 Form feed characters break odt export Joseph Turner via General discussions about Org-mode. 2024-12-21 3:56 ` Max Nikulin @ 2024-12-23 17:32 ` Ihor Radchenko 2024-12-24 11:04 ` Christian Moe 1 sibling, 1 reply; 18+ messages in thread From: Ihor Radchenko @ 2024-12-23 17:32 UTC (permalink / raw) To: Joseph Turner; +Cc: Org Mode Mailing List, Bohong Huang Joseph Turner via "General discussions about Org-mode." <emacs-orgmode@gnu.org> writes: > I can export the following Org content to a .odt file, but the exported > file cannot be opened ("Read Error. Format error discovered in the file > in sub-document content.xml at 368,2(row,col).") > > --8<---------------cut here---------------start------------->8--- > #+TITLE: Foo > * Bar > Baz > \f > --8<---------------cut here---------------end--------------->8--- Looks like ^L is not allowed in ODT files. However, I see no such information on http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html May somebody check if there is an official list of unsupported characters in ODT? Or maybe it is simply a bug in LibreOffice? -- Ihor Radchenko // yantar92, Org mode maintainer, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-23 17:32 ` Ihor Radchenko @ 2024-12-24 11:04 ` Christian Moe 2024-12-24 14:14 ` Ihor Radchenko 2024-12-24 14:25 ` Max Nikulin 0 siblings, 2 replies; 18+ messages in thread From: Christian Moe @ 2024-12-24 11:04 UTC (permalink / raw) To: Ihor Radchenko; +Cc: Joseph Turner, Org Mode Mailing List, Bohong Huang (re-sending to include the list, apologies, recent mu4e ui changes keep tripping me up) Ihor Radchenko <yantar92@posteo.net> writes: > Joseph Turner via "General discussions about Org-mode." > <emacs-orgmode@gnu.org> writes: > >> I can export the following Org content to a .odt file, but the exported >> file cannot be opened ("Read Error. Format error discovered in the file >> in sub-document content.xml at 368,2(row,col).") >> >> --8<---------------cut here---------------start------------->8--- >> #+TITLE: Foo >> * Bar >> Baz >> \f >> --8<---------------cut here---------------end--------------->8--- > > Looks like ^L is not allowed in ODT files. > However, I see no such information on > http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html > > May somebody check if there is an official list of unsupported > characters in ODT? Or maybe it is simply a bug in LibreOffice? I don't think it's specific to ODT or LibreOffice, it's the underlying XML 1.0 spec that "discourages" control characters and does not include #xC in the range of characters that XML processors must accept. Spec: https://www.w3.org/TR/REC-xml/#charsets Some discussion: https://stackoverflow.com/questions/404107/why-are-control-characters-illegal-in-xml-1-0 Yours, Christian ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-24 11:04 ` Christian Moe @ 2024-12-24 14:14 ` Ihor Radchenko 2024-12-25 10:10 ` Joseph Turner 2024-12-24 14:25 ` Max Nikulin 1 sibling, 1 reply; 18+ messages in thread From: Ihor Radchenko @ 2024-12-24 14:14 UTC (permalink / raw) To: Christian Moe; +Cc: Joseph Turner, Org Mode Mailing List, Bohong Huang [-- Attachment #1: Type: text/plain, Size: 520 bytes --] Christian Moe <mail@christianmoe.com> writes: > I don't think it's specific to ODT or LibreOffice, it's the underlying > XML 1.0 spec that "discourages" control characters and does not include > #xC in the range of characters that XML processors must accept. > > Spec: https://www.w3.org/TR/REC-xml/#charsets > > Some discussion: > https://stackoverflow.com/questions/404107/why-are-control-characters-illegal-in-xml-1-0 Thanks! Then, we can simply remove the disallowed characters. See the attached tentative patch. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-ox-odt-Avoid-putting-forbidden-characters-into-ODT-x.patch --] [-- Type: text/x-patch, Size: 3707 bytes --] From 5c9fdd9df32b87be9f81e037336332984bc3b16c Mon Sep 17 00:00:00 2001 Message-ID: <5c9fdd9df32b87be9f81e037336332984bc3b16c.1735049605.git.yantar92@posteo.net> From: Ihor Radchenko <yantar92@posteo.net> Date: Tue, 24 Dec 2024 15:11:22 +0100 Subject: [PATCH] ox-odt: Avoid putting forbidden characters into ODT xml * lisp/ox-odt.el (org-odt-forbidden-char-re): (org-odt-discouraged-char-re): New constants codifying characters that are prohibited in XML spec. (org-odt--remove-forbidden): New function removing the prohibited characters. (org-odt--encode-plain-text): Remove the prohibited characters. (org-odt-plain-text): Update comment. Reported-by: Joseph Turner <joseph@breatheoutbreathe.in> Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com --- lisp/ox-odt.el | 38 +++++++++++++++++++++++++++++++++++--- 1 file changed, 35 insertions(+), 3 deletions(-) diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el index ec81637ef0..61c8d4ec75 100644 --- a/lisp/ox-odt.el +++ b/lisp/ox-odt.el @@ -170,6 +170,28 @@ (defconst org-odt-special-string-regexps ("\\.\\.\\." . "…")) ; hellip "Regular expressions for special string conversion.") +(defconst org-odt-forbidden-char-re + (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D} + (?\N{U+20} . ?\N{U+D7FF}) + (?\N{U+E000} . ?\N{U+FFFD}) + (?\N{U+10000} . ?\N{U+10FFFF})))) + "Regexp matching forbidden XML1.0 characters. +https://www.w3.org/TR/REC-xml/#charsets") + +(defconst org-odt-discouraged-char-re + (rx (in (?\N{U+7F} . ?\N{U+84}) (?\N{U+86} . ?\N{U+9F}) + (?\N{U+FDD0} . ?\N{U+FDEF}) (?\N{U+1FFFE} . ?\N{U+1FFFF}) + (?\N{U+2FFFE} . ?\N{U+2FFFF}) (?\N{U+3FFFE} . ?\N{U+3FFFF}) + (?\N{U+4FFFE} . ?\N{U+4FFFF}) (?\N{U+5FFFE} . ?\N{U+5FFFF}) + (?\N{U+6FFFE} . ?\N{U+6FFFF}) (?\N{U+7FFFE} . ?\N{U+7FFFF}) + (?\N{U+8FFFE} . ?\N{U+8FFFF}) (?\N{U+9FFFE} . ?\N{U+9FFFF}) + (?\N{U+AFFFE} . ?\N{U+AFFFF}) (?\N{U+BFFFE} . ?\N{U+BFFFF}) + (?\N{U+CFFFE} . ?\N{U+CFFFF}) (?\N{U+DFFFE} . ?\N{U+DFFFF}) + (?\N{U+EFFFE} . ?\N{U+EFFFF}) (?\N{U+FFFFE} . ?\N{U+FFFFF}) + (?\N{U+10FFFE} . ?\N{U+10FFFF}))) + "Regexp matching discouraged XML1.0 characters. +https://www.w3.org/TR/REC-xml/#charsets") + (defconst org-odt-schema-dir-list (list (expand-file-name "./schema/" org-odt-data-dir)) "List of directories to search for OpenDocument schema files. @@ -2892,18 +2914,28 @@ (defun org-odt--encode-tabs-and-spaces (line) (format " <text:s text:c=\"%d\"/>" (1- (length s))))) line)) +(defun org-odt--remove-forbidden (text) + "Remove forbidden and discouraged characters from TEXT. +https://www.w3.org/TR/REC-xml/#charsets" + (replace-regexp-in-string + org-odt-forbidden-char-re "" + (replace-regexp-in-string + org-odt-discouraged-char-re "" + text))) + (defun org-odt--encode-plain-text (text &optional no-whitespace-filling) (dolist (pair '(("&" . "&") ("<" . "<") (">" . ">"))) (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t))) - (if no-whitespace-filling text - (org-odt--encode-tabs-and-spaces text))) + (org-odt--remove-forbidden + (if no-whitespace-filling text + (org-odt--encode-tabs-and-spaces text)))) (defun org-odt-plain-text (text info) "Transcode a TEXT string from Org to ODT. TEXT is the string to transcode. INFO is a plist holding contextual information." (let ((output text)) - ;; Protect &, < and >. + ;; Protect &, < and >, and remove forbidden characters. (setq output (org-odt--encode-plain-text output t)) ;; Handle smart quotes. Be sure to provide original string since ;; OUTPUT may have been modified. -- 2.47.1 [-- Attachment #3: Type: text/plain, Size: 223 bytes --] -- Ihor Radchenko // yantar92, Org mode maintainer, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-24 14:14 ` Ihor Radchenko @ 2024-12-25 10:10 ` Joseph Turner 2024-12-27 10:21 ` Ihor Radchenko 0 siblings, 1 reply; 18+ messages in thread From: Joseph Turner @ 2024-12-25 10:10 UTC (permalink / raw) To: Ihor Radchenko; +Cc: Christian Moe, Org Mode Mailing List, Bohong Huang Ihor Radchenko <yantar92@posteo.net> writes: > Christian Moe <mail@christianmoe.com> writes: > >> I don't think it's specific to ODT or LibreOffice, it's the underlying >> XML 1.0 spec that "discourages" control characters and does not include >> #xC in the range of characters that XML processors must accept. >> >> Spec: https://www.w3.org/TR/REC-xml/#charsets >> >> Some discussion: >> https://stackoverflow.com/questions/404107/why-are-control-characters-illegal-in-xml-1-0 > > Thanks! > Then, we can simply remove the disallowed characters. > See the attached tentative patch. > > From 5c9fdd9df32b87be9f81e037336332984bc3b16c Mon Sep 17 00:00:00 2001 > Message-ID: <5c9fdd9df32b87be9f81e037336332984bc3b16c.1735049605.git.yantar92@posteo.net> > From: Ihor Radchenko <yantar92@posteo.net> > Date: Tue, 24 Dec 2024 15:11:22 +0100 > Subject: [PATCH] ox-odt: Avoid putting forbidden characters into ODT xml > > * lisp/ox-odt.el (org-odt-forbidden-char-re): > (org-odt-discouraged-char-re): New constants codifying characters that > are prohibited in XML spec. > (org-odt--remove-forbidden): New function removing the prohibited > characters. > (org-odt--encode-plain-text): Remove the prohibited characters. > (org-odt-plain-text): Update comment. > > Reported-by: Joseph Turner <joseph@breatheoutbreathe.in> > Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com > --- > lisp/ox-odt.el | 38 +++++++++++++++++++++++++++++++++++--- > 1 file changed, 35 insertions(+), 3 deletions(-) > > diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el > index ec81637ef0..61c8d4ec75 100644 > --- a/lisp/ox-odt.el > +++ b/lisp/ox-odt.el > @@ -170,6 +170,28 @@ (defconst org-odt-special-string-regexps > ("\\.\\.\\." . "…")) ; hellip > "Regular expressions for special string conversion.") > > +(defconst org-odt-forbidden-char-re > + (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D} > + (?\N{U+20} . ?\N{U+D7FF}) > + (?\N{U+E000} . ?\N{U+FFFD}) > + (?\N{U+10000} . ?\N{U+10FFFF})))) > + "Regexp matching forbidden XML1.0 characters. > +https://www.w3.org/TR/REC-xml/#charsets") > + > +(defconst org-odt-discouraged-char-re > + (rx (in (?\N{U+7F} . ?\N{U+84}) (?\N{U+86} . ?\N{U+9F}) > + (?\N{U+FDD0} . ?\N{U+FDEF}) (?\N{U+1FFFE} . ?\N{U+1FFFF}) > + (?\N{U+2FFFE} . ?\N{U+2FFFF}) (?\N{U+3FFFE} . ?\N{U+3FFFF}) > + (?\N{U+4FFFE} . ?\N{U+4FFFF}) (?\N{U+5FFFE} . ?\N{U+5FFFF}) > + (?\N{U+6FFFE} . ?\N{U+6FFFF}) (?\N{U+7FFFE} . ?\N{U+7FFFF}) > + (?\N{U+8FFFE} . ?\N{U+8FFFF}) (?\N{U+9FFFE} . ?\N{U+9FFFF}) > + (?\N{U+AFFFE} . ?\N{U+AFFFF}) (?\N{U+BFFFE} . ?\N{U+BFFFF}) > + (?\N{U+CFFFE} . ?\N{U+CFFFF}) (?\N{U+DFFFE} . ?\N{U+DFFFF}) > + (?\N{U+EFFFE} . ?\N{U+EFFFF}) (?\N{U+FFFFE} . ?\N{U+FFFFF}) > + (?\N{U+10FFFE} . ?\N{U+10FFFF}))) > + "Regexp matching discouraged XML1.0 characters. > +https://www.w3.org/TR/REC-xml/#charsets") > + > (defconst org-odt-schema-dir-list > (list (expand-file-name "./schema/" org-odt-data-dir)) > "List of directories to search for OpenDocument schema files. > @@ -2892,18 +2914,28 @@ (defun org-odt--encode-tabs-and-spaces (line) > (format " <text:s text:c=\"%d\"/>" (1- (length s))))) > line)) > > +(defun org-odt--remove-forbidden (text) > + "Remove forbidden and discouraged characters from TEXT. > +https://www.w3.org/TR/REC-xml/#charsets" > + (replace-regexp-in-string > + org-odt-forbidden-char-re "" > + (replace-regexp-in-string > + org-odt-discouraged-char-re "" > + text))) > + > (defun org-odt--encode-plain-text (text &optional no-whitespace-filling) > (dolist (pair '(("&" . "&") ("<" . "<") (">" . ">"))) > (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t))) > - (if no-whitespace-filling text > - (org-odt--encode-tabs-and-spaces text))) > + (org-odt--remove-forbidden > + (if no-whitespace-filling text > + (org-odt--encode-tabs-and-spaces text)))) > > (defun org-odt-plain-text (text info) > "Transcode a TEXT string from Org to ODT. > TEXT is the string to transcode. INFO is a plist holding > contextual information." > (let ((output text)) > - ;; Protect &, < and >. > + ;; Protect &, < and >, and remove forbidden characters. > (setq output (org-odt--encode-plain-text output t)) > ;; Handle smart quotes. Be sure to provide original string since > ;; OUTPUT may have been modified. > -- > 2.47.1 Thanks, Ihor! Tested working on my machine. Here's another potential solution to consider, which adds a defcustom to let the user decide how to handle forbidden characters: https://github.com/kjambunathan/org-mode-ox-odt/commit/07fde1e9b7cdda3e3ef8136f5b1d478499dfd780 Joseph ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-25 10:10 ` Joseph Turner @ 2024-12-27 10:21 ` Ihor Radchenko 2024-12-27 20:42 ` Joseph Turner 0 siblings, 1 reply; 18+ messages in thread From: Ihor Radchenko @ 2024-12-27 10:21 UTC (permalink / raw) To: Joseph Turner; +Cc: Christian Moe, Org Mode Mailing List, Bohong Huang [-- Attachment #1: Type: text/plain, Size: 443 bytes --] Joseph Turner <joseph@breatheoutbreathe.in> writes: > Thanks, Ihor! Tested working on my machine. > > Here's another potential solution to consider, which adds a defcustom to > let the user decide how to handle forbidden characters: > > https://github.com/kjambunathan/org-mode-ox-odt/commit/07fde1e9b7cdda3e3ef8136f5b1d478499dfd780 Good idea! I went even further and used a proper export setting. See the attached 2nd version of the fix. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: v2-0001-ox-odt-Avoid-putting-forbidden-characters-into-OD.patch --] [-- Type: text/x-patch, Size: 4293 bytes --] From de015e4a3b98bc975c2dcd1dfce7adcf77eb537c Mon Sep 17 00:00:00 2001 Message-ID: <de015e4a3b98bc975c2dcd1dfce7adcf77eb537c.1735294805.git.yantar92@posteo.net> From: Ihor Radchenko <yantar92@posteo.net> Date: Tue, 24 Dec 2024 15:11:22 +0100 Subject: [PATCH v2] ox-odt: Avoid putting forbidden characters into ODT xml * lisp/ox-odt.el (org-odt-with-forbidden-chars): New export option to control how to handle forbidden XML characters. (org-odt--remove-forbidden): New filter removing/replacing forbidden characters. Reported-by: Joseph Turner <joseph@breatheoutbreathe.in> Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com --- lisp/ox-odt.el | 43 ++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 42 insertions(+), 1 deletion(-) diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el index ec81637ef0..635bf38971 100644 --- a/lisp/ox-odt.el +++ b/lisp/ox-odt.el @@ -94,7 +94,8 @@ (org-export-define-backend 'odt . (org-odt--translate-latex-fragments org-odt--translate-description-lists org-odt--translate-list-tables - org-odt--translate-image-links))) + org-odt--translate-image-links)) + (:filter-final-output . org-odt--remove-forbidden)) :menu-entry '(?o "Export to ODT" ((?o "As ODT file" org-odt-export-to-odt) @@ -108,6 +109,7 @@ (org-export-define-backend 'odt (:keywords "KEYWORDS" nil nil space) (:subtitle "SUBTITLE" nil nil parse) ;; Other variables. + (:odt-with-forbidden-chars nil nil org-odt-with-forbidden-chars) (:odt-content-template-file nil nil org-odt-content-template-file) (:odt-display-outline-level nil nil org-odt-display-outline-level) (:odt-fontify-srcblocks nil nil org-odt-fontify-srcblocks) @@ -170,6 +172,14 @@ (defconst org-odt-special-string-regexps ("\\.\\.\\." . "…")) ; hellip "Regular expressions for special string conversion.") +(defconst org-odt-forbidden-char-re + (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D} + (?\N{U+20} . ?\N{U+D7FF}) + (?\N{U+E000} . ?\N{U+FFFD}) + (?\N{U+10000} . ?\N{U+10FFFF})))) + "Regexp matching forbidden XML1.0 characters. +https://www.w3.org/TR/REC-xml/#charsets") + (defconst org-odt-schema-dir-list (list (expand-file-name "./schema/" org-odt-data-dir)) "List of directories to search for OpenDocument schema files. @@ -364,6 +374,19 @@ (defgroup org-export-odt nil :tag "Org Export ODT" :group 'org-export) +(defcustom org-odt-with-forbidden-chars "" + "String to replace forbidden XML characters. +When set to t, forbidden characters are retained. +When set to nil, an error is thrown. +See `org-odt-forbidden-char-re' for the list of forbidden characters +that cannot occur inside ODT documents. + +You may also consider export filters to perform more fine-grained +replacements. See info node `(org)Advanced Export Configuration'." + :package-version '(Org . "9.8") + :type '(choice (const :tag "Strip forbidden characters" t) + (const :tag "Err when forbidden characters encountered" nil) + (string :tag "Replacement string"))) ;;;; Debugging @@ -2892,6 +2915,24 @@ (defun org-odt--encode-tabs-and-spaces (line) (format " <text:s text:c=\"%d\"/>" (1- (length s))))) line)) +(defun org-odt--remove-forbidden (text _backend info) + "Remove forbidden and discouraged characters from TEXT. +INFO is the communication plist" + (pcase (plist-get info :odt-with-forbidden-chars) + ((and (pred stringp) rep) + (prog1 (replace-regexp-in-string org-odt-forbidden-char-re rep text) + (when (match-string 0 text) + (display-warning + '(ox-odt ox-odt-with-forbidden-chars) + (format "Replacing forbidden character '%s' with '%s'" + (match-string 0 text) rep))))) + (`nil + (if (string-match org-odt-forbidden-char-re text) + (error "Forbidden character '%s' found. See `org-odt-with-forbidden-chars'" + (match-string 0 text)) + text)) + (_ text))) + (defun org-odt--encode-plain-text (text &optional no-whitespace-filling) (dolist (pair '(("&" . "&") ("<" . "<") (">" . ">"))) (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t))) -- 2.47.1 [-- Attachment #3: Type: text/plain, Size: 223 bytes --] -- Ihor Radchenko // yantar92, Org mode maintainer, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-27 10:21 ` Ihor Radchenko @ 2024-12-27 20:42 ` Joseph Turner 2024-12-28 8:32 ` Ihor Radchenko 0 siblings, 1 reply; 18+ messages in thread From: Joseph Turner @ 2024-12-27 20:42 UTC (permalink / raw) To: Ihor Radchenko; +Cc: Christian Moe, Org Mode Mailing List, Bohong Huang [-- Attachment #1: Type: text/plain, Size: 5065 bytes --] Ihor Radchenko <yantar92@posteo.net> writes: [...] > +(defconst org-odt-forbidden-char-re > + (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D} > + (?\N{U+20} . ?\N{U+D7FF}) > + (?\N{U+E000} . ?\N{U+FFFD}) > + (?\N{U+10000} . ?\N{U+10FFFF})))) Indentation mismatch ^ > + "Regexp matching forbidden XML1.0 characters. > +https://www.w3.org/TR/REC-xml/#charsets") > + > (defconst org-odt-schema-dir-list > (list (expand-file-name "./schema/" org-odt-data-dir)) > "List of directories to search for OpenDocument schema files. > @@ -364,6 +374,19 @@ (defgroup org-export-odt nil > :tag "Org Export ODT" > :group 'org-export) > > +(defcustom org-odt-with-forbidden-chars "" > + "String to replace forbidden XML characters. > +When set to t, forbidden characters are retained. > +When set to nil, an error is thrown. > +See `org-odt-forbidden-char-re' for the list of forbidden characters > +that cannot occur inside ODT documents. > + > +You may also consider export filters to perform more fine-grained > +replacements. See info node `(org)Advanced Export Configuration'." > + :package-version '(Org . "9.8") > + :type '(choice (const :tag "Strip forbidden characters" t) According to the docstring, the above tag should say "Leave forbidden characters as-is". See patch which slightly rewords the docstring too. > + (const :tag "Err when forbidden characters encountered" nil) > + (string :tag "Replacement string"))) > > ;;;; Debugging > > @@ -2892,6 +2915,24 @@ (defun org-odt--encode-tabs-and-spaces (line) > (format " <text:s text:c=\"%d\"/>" (1- (length s))))) > line)) > > +(defun org-odt--remove-forbidden (text _backend info) > + "Remove forbidden and discouraged characters from TEXT. > +INFO is the communication plist" > + (pcase (plist-get info :odt-with-forbidden-chars) Should we use pcase-exhaustive? > + ((and (pred stringp) rep) > + (prog1 (replace-regexp-in-string org-odt-forbidden-char-re rep text) > + (when (match-string 0 text) The replacement appears to work well on my machine, but there are unnecessary warnings. Run org-odt-export-to-odt on a buffer containing: --8<---------------cut here---------------start------------->8--- * foo bar --8<---------------cut here---------------end--------------->8--- the (match-string 0 text) form inside org-odt--remove-forbidden evals to "<?xml version=\"1.0\" " which causes the incorrect warning message "Warning (ox-odt): Replacing forbidden character '' with ''" Confusingly, `text' and the replacement text are string-equal, so it appears that no replacement has been made. I suspect that match-string and replace-regexp-in-string perhaps do not play well together. Try this out: (let* ((text "bar") (new (replace-regexp-in-string "r" "z" text))) new ; "baz", as expected (match-string 0 new) ; signals error (match-string 0 text)) ; signals error I get the following stack trace (for the first error): Debugger entered--Lisp error: (args-out-of-range "baz" 402 403) substring("baz" 402 403) (if string (substring string (match-beginning num) (match-end num)) (buffer-substring (match-beginning num) (match-end num))) (if (match-beginning num) (if string (substring string (match-beginning num) (match-end num)) (buffer-substring (match-beginning num) (match-end num)))) match-string(0 "baz") (let* ((text "bar") (new (replace-regexp-in-string "r" "z" text))) new (match-string 0 new) (match-string 0 text)) (progn (let* ((text "bar") (new (replace-regexp-in-string "r" "z" text))) new (match-string 0 new) (match-string 0 text))) (let ((print-level nil) (print-length nil)) (progn (let* ((text "bar") (new (replace-regexp-in-string "r" "z" text))) new (match-string 0 new) (match-string 0 text)))) (setq elisp--eval-defun-result (let ((print-level nil) (print-length nil)) (progn (let* ((text "bar") (new (replace-regexp-in-string "r" "z" text))) new (match-string 0 new) (match-string 0 text))))) elisp--eval-defun() #<subr eval-defun>(nil) edebug--eval-defun(#<subr eval-defun> nil) apply(edebug--eval-defun #<subr eval-defun> nil) eval-defun(nil) funcall-interactively(eval-defun nil) command-execute(eval-defun) Also with the replace-regexp-in-string design, there will only be one warning even with multiple forbidden characters. See patch below. > + (display-warning > + '(ox-odt ox-odt-with-forbidden-chars) > + (format "Replacing forbidden character '%s' with '%s'" > + (match-string 0 text) rep))))) > + (`nil > + (if (string-match org-odt-forbidden-char-re text) > + (error "Forbidden character '%s' found. See `org-odt-with-forbidden-chars'" > + (match-string 0 text)) > + text)) > + (_ text))) > + > (defun org-odt--encode-plain-text (text &optional no-whitespace-filling) > (dolist (pair '(("&" . "&") ("<" . "<") (">" . ">"))) > (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t))) > -- > 2.47.1 [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-ox-odt-Avoid-putting-forbidden-characters-into-ODT-x.patch --] [-- Type: text/x-diff, Size: 4623 bytes --] From ce506caa0bffbd243a2aba384f75f7aaac7fdc4b Mon Sep 17 00:00:00 2001 From: Ihor Radchenko <yantar92@posteo.net> Date: Fri, 27 Dec 2024 10:21:02 +0000 Subject: [PATCH] ox-odt: Avoid putting forbidden characters into ODT xml * lisp/ox-odt.el (org-odt-with-forbidden-chars): New export option to control how to handle forbidden XML characters. (org-odt--remove-forbidden): New filter removing/replacing forbidden characters. Co-authored-by: Joseph Turner <joseph@breatheoutbreathe.in> Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com --- lisp/ox-odt.el | 51 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 50 insertions(+), 1 deletion(-) diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el index ec81637ef..960bab286 100644 --- a/lisp/ox-odt.el +++ b/lisp/ox-odt.el @@ -94,7 +94,8 @@ (org-export-define-backend 'odt . (org-odt--translate-latex-fragments org-odt--translate-description-lists org-odt--translate-list-tables - org-odt--translate-image-links))) + org-odt--translate-image-links)) + (:filter-final-output . org-odt--remove-forbidden)) :menu-entry '(?o "Export to ODT" ((?o "As ODT file" org-odt-export-to-odt) @@ -108,6 +109,7 @@ (org-export-define-backend 'odt (:keywords "KEYWORDS" nil nil space) (:subtitle "SUBTITLE" nil nil parse) ;; Other variables. + (:odt-with-forbidden-chars nil nil org-odt-with-forbidden-chars) (:odt-content-template-file nil nil org-odt-content-template-file) (:odt-display-outline-level nil nil org-odt-display-outline-level) (:odt-fontify-srcblocks nil nil org-odt-fontify-srcblocks) @@ -170,6 +172,14 @@ (defconst org-odt-special-string-regexps ("\\.\\.\\." . "…")) ; hellip "Regular expressions for special string conversion.") +(defconst org-odt-forbidden-char-re + (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D} + (?\N{U+20} . ?\N{U+D7FF}) + (?\N{U+E000} . ?\N{U+FFFD}) + (?\N{U+10000} . ?\N{U+10FFFF})))) + "Regexp matching forbidden XML1.0 characters. +https://www.w3.org/TR/REC-xml/#charsets") + (defconst org-odt-schema-dir-list (list (expand-file-name "./schema/" org-odt-data-dir)) "List of directories to search for OpenDocument schema files. @@ -364,6 +374,19 @@ (defgroup org-export-odt nil :tag "Org Export ODT" :group 'org-export) +(defcustom org-odt-with-forbidden-chars "" + "String to replace forbidden XML characters. +When set to t, forbidden characters are left as-is. +When set to nil, an error is thrown. +See `org-odt-forbidden-char-re' for the list of forbidden characters +that cannot occur inside ODT documents. + +You may also consider export filters to perform more fine-grained +replacements. See info node `(org)Advanced Export Configuration'." + :package-version '(Org . "9.8") + :type '(choice (const :tag "Leave forbidden characters as-is" t) + (const :tag "Err when forbidden characters encountered" nil) + (string :tag "Replacement string"))) ;;;; Debugging @@ -2892,6 +2915,32 @@ (defun org-odt--encode-tabs-and-spaces (line) (format " <text:s text:c=\"%d\"/>" (1- (length s))))) line)) +(defun org-odt--remove-forbidden (text _backend info) + "Remove forbidden and discouraged characters from TEXT. +INFO is the communication plist" + (pcase-exhaustive (plist-get info :odt-with-forbidden-chars) + ((and (pred stringp) rep) + (let ((replacements (make-hash-table :test 'equal))) + (with-temp-buffer + (insert text) + (goto-char (point-min)) + (while (re-search-forward org-odt-forbidden-char-re nil t) + (cl-incf (gethash (match-string 0) replacements 0)) + (replace-match rep)) + (cl-loop for forbidden being the hash-keys of replacements + using (hash-values count) + do (display-warning + '(ox-odt ox-odt-with-forbidden-chars) + (format "Replaced forbidden character '%s' with '%s' %d times" + forbidden rep count))) + (buffer-string)))) + (`nil + (if (string-match org-odt-forbidden-char-re text) + (error "Forbidden character '%s' found. See `org-odt-with-forbidden-chars'" + (match-string 0 text)) + text)) + ('t text))) + (defun org-odt--encode-plain-text (text &optional no-whitespace-filling) (dolist (pair '(("&" . "&") ("<" . "<") (">" . ">"))) (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t))) -- 2.46.0 [-- Attachment #3: Type: text/plain, Size: 21 bytes --] Thank you!! Joseph ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-27 20:42 ` Joseph Turner @ 2024-12-28 8:32 ` Ihor Radchenko 2024-12-28 9:50 ` Joseph Turner 0 siblings, 1 reply; 18+ messages in thread From: Ihor Radchenko @ 2024-12-28 8:32 UTC (permalink / raw) To: Joseph Turner; +Cc: Christian Moe, Org Mode Mailing List, Bohong Huang [-- Attachment #1: Type: text/plain, Size: 402 bytes --] Joseph Turner <joseph@breatheoutbreathe.in> writes: > From ce506caa0bffbd243a2aba384f75f7aaac7fdc4b Mon Sep 17 00:00:00 2001 > From: Ihor Radchenko <yantar92@posteo.net> > Date: Fri, 27 Dec 2024 10:21:02 +0000 > Subject: [PATCH] ox-odt: Avoid putting forbidden characters into ODT xml Thanks for helping with the patch! I modified it further, adding ORG-NEWS entry announcing the new export option. [-- Attachment #2: v3-0001-ox-odt-Avoid-putting-forbidden-characters-into-OD.patch --] [-- Type: text/x-patch, Size: 6021 bytes --] From 89901da3a0d00598c5ac40cddb2f6dec7c7047cf Mon Sep 17 00:00:00 2001 Message-ID: <89901da3a0d00598c5ac40cddb2f6dec7c7047cf.1735374641.git.yantar92@posteo.net> From: Ihor Radchenko <yantar92@posteo.net> Date: Fri, 27 Dec 2024 10:21:02 +0000 Subject: [PATCH v3] ox-odt: Avoid putting forbidden characters into ODT xml * lisp/ox-odt.el (org-odt-with-forbidden-chars): New export option to control how to handle forbidden XML characters. (org-odt--remove-forbidden): New filter removing/replacing forbidden characters. * etc/ORG-NEWS (ox-odt: New export option ~org-odt-with-forbidden-chars~): Announce the new option. Co-authored-by: Joseph Turner <joseph@breatheoutbreathe.in> Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com --- etc/ORG-NEWS | 16 ++++++++++++++++ lisp/ox-odt.el | 51 +++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 66 insertions(+), 1 deletion(-) diff --git a/etc/ORG-NEWS b/etc/ORG-NEWS index d26813c983..a56e105481 100644 --- a/etc/ORG-NEWS +++ b/etc/ORG-NEWS @@ -182,6 +182,22 @@ now be pasted as an Org table using ~yank-media~. # adding new customizations, or changing the interpretation of the # existing customizations. +*** ox-odt: New export option ~org-odt-with-forbidden-chars~ + +The new export option controls how to deal with characters that are forbidden +inside ODT documents during export. + +The ODT documents must follow XML1.0 specification and cannot contain +certain unicode characters. For example, form feed characters like ^L +are disallowed. + +By default, =ox-odt= will strip such characters and display warning. +You may return to the previous behaviour by setting +~org-odt-with-forbidden-chars~ to t. + +Note that Emacs warnings can always be suppressed by clicking on ⛔ +symbol or by customizing ~warning-suppress-types~. + *** New option ~org-edit-keep-region~ Since Org 9.7, structure editing commands do not deactivate region diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el index ec81637ef0..960bab286a 100644 --- a/lisp/ox-odt.el +++ b/lisp/ox-odt.el @@ -94,7 +94,8 @@ (org-export-define-backend 'odt . (org-odt--translate-latex-fragments org-odt--translate-description-lists org-odt--translate-list-tables - org-odt--translate-image-links))) + org-odt--translate-image-links)) + (:filter-final-output . org-odt--remove-forbidden)) :menu-entry '(?o "Export to ODT" ((?o "As ODT file" org-odt-export-to-odt) @@ -108,6 +109,7 @@ (org-export-define-backend 'odt (:keywords "KEYWORDS" nil nil space) (:subtitle "SUBTITLE" nil nil parse) ;; Other variables. + (:odt-with-forbidden-chars nil nil org-odt-with-forbidden-chars) (:odt-content-template-file nil nil org-odt-content-template-file) (:odt-display-outline-level nil nil org-odt-display-outline-level) (:odt-fontify-srcblocks nil nil org-odt-fontify-srcblocks) @@ -170,6 +172,14 @@ (defconst org-odt-special-string-regexps ("\\.\\.\\." . "…")) ; hellip "Regular expressions for special string conversion.") +(defconst org-odt-forbidden-char-re + (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D} + (?\N{U+20} . ?\N{U+D7FF}) + (?\N{U+E000} . ?\N{U+FFFD}) + (?\N{U+10000} . ?\N{U+10FFFF})))) + "Regexp matching forbidden XML1.0 characters. +https://www.w3.org/TR/REC-xml/#charsets") + (defconst org-odt-schema-dir-list (list (expand-file-name "./schema/" org-odt-data-dir)) "List of directories to search for OpenDocument schema files. @@ -364,6 +374,19 @@ (defgroup org-export-odt nil :tag "Org Export ODT" :group 'org-export) +(defcustom org-odt-with-forbidden-chars "" + "String to replace forbidden XML characters. +When set to t, forbidden characters are left as-is. +When set to nil, an error is thrown. +See `org-odt-forbidden-char-re' for the list of forbidden characters +that cannot occur inside ODT documents. + +You may also consider export filters to perform more fine-grained +replacements. See info node `(org)Advanced Export Configuration'." + :package-version '(Org . "9.8") + :type '(choice (const :tag "Leave forbidden characters as-is" t) + (const :tag "Err when forbidden characters encountered" nil) + (string :tag "Replacement string"))) ;;;; Debugging @@ -2892,6 +2915,32 @@ (defun org-odt--encode-tabs-and-spaces (line) (format " <text:s text:c=\"%d\"/>" (1- (length s))))) line)) +(defun org-odt--remove-forbidden (text _backend info) + "Remove forbidden and discouraged characters from TEXT. +INFO is the communication plist" + (pcase-exhaustive (plist-get info :odt-with-forbidden-chars) + ((and (pred stringp) rep) + (let ((replacements (make-hash-table :test 'equal))) + (with-temp-buffer + (insert text) + (goto-char (point-min)) + (while (re-search-forward org-odt-forbidden-char-re nil t) + (cl-incf (gethash (match-string 0) replacements 0)) + (replace-match rep)) + (cl-loop for forbidden being the hash-keys of replacements + using (hash-values count) + do (display-warning + '(ox-odt ox-odt-with-forbidden-chars) + (format "Replaced forbidden character '%s' with '%s' %d times" + forbidden rep count))) + (buffer-string)))) + (`nil + (if (string-match org-odt-forbidden-char-re text) + (error "Forbidden character '%s' found. See `org-odt-with-forbidden-chars'" + (match-string 0 text)) + text)) + ('t text))) + (defun org-odt--encode-plain-text (text &optional no-whitespace-filling) (dolist (pair '(("&" . "&") ("<" . "<") (">" . ">"))) (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t))) -- 2.47.1 [-- Attachment #3: Type: text/plain, Size: 223 bytes --] -- Ihor Radchenko // yantar92, Org mode maintainer, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-28 8:32 ` Ihor Radchenko @ 2024-12-28 9:50 ` Joseph Turner 2024-12-28 15:50 ` Ihor Radchenko 0 siblings, 1 reply; 18+ messages in thread From: Joseph Turner @ 2024-12-28 9:50 UTC (permalink / raw) To: Ihor Radchenko; +Cc: Christian Moe, Org Mode Mailing List, Bohong Huang Ihor Radchenko <yantar92@posteo.net> writes: > Joseph Turner <joseph@breatheoutbreathe.in> writes: > >> From ce506caa0bffbd243a2aba384f75f7aaac7fdc4b Mon Sep 17 00:00:00 2001 >> From: Ihor Radchenko <yantar92@posteo.net> >> Date: Fri, 27 Dec 2024 10:21:02 +0000 >> Subject: [PATCH] ox-odt: Avoid putting forbidden characters into ODT xml > > Thanks for helping with the patch! > I modified it further, adding ORG-NEWS entry announcing the new export > option. > > From 89901da3a0d00598c5ac40cddb2f6dec7c7047cf Mon Sep 17 00:00:00 2001 > Message-ID: <89901da3a0d00598c5ac40cddb2f6dec7c7047cf.1735374641.git.yantar92@posteo.net> > From: Ihor Radchenko <yantar92@posteo.net> > Date: Fri, 27 Dec 2024 10:21:02 +0000 > Subject: [PATCH v3] ox-odt: Avoid putting forbidden characters into ODT xml > > * lisp/ox-odt.el (org-odt-with-forbidden-chars): New export option to > control how to handle forbidden XML characters. > (org-odt--remove-forbidden): New filter removing/replacing forbidden > characters. > * etc/ORG-NEWS (ox-odt: New export option > ~org-odt-with-forbidden-chars~): Announce the new option. > > Co-authored-by: Joseph Turner <joseph@breatheoutbreathe.in> > Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com > --- > etc/ORG-NEWS | 16 ++++++++++++++++ > lisp/ox-odt.el | 51 +++++++++++++++++++++++++++++++++++++++++++++++++- > 2 files changed, 66 insertions(+), 1 deletion(-) > > diff --git a/etc/ORG-NEWS b/etc/ORG-NEWS > index d26813c983..a56e105481 100644 > --- a/etc/ORG-NEWS > +++ b/etc/ORG-NEWS > @@ -182,6 +182,22 @@ now be pasted as an Org table using ~yank-media~. > # adding new customizations, or changing the interpretation of the > # existing customizations. > > +*** ox-odt: New export option ~org-odt-with-forbidden-chars~ > + > +The new export option controls how to deal with characters that are forbidden > +inside ODT documents during export. > + > +The ODT documents must follow XML1.0 specification and cannot contain > +certain unicode characters. For example, form feed characters like ^L > +are disallowed. > + > +By default, =ox-odt= will strip such characters and display warning. > +You may return to the previous behaviour by setting > +~org-odt-with-forbidden-chars~ to t. > + > +Note that Emacs warnings can always be suppressed by clicking on ⛔ > +symbol or by customizing ~warning-suppress-types~. > + > *** New option ~org-edit-keep-region~ > > Since Org 9.7, structure editing commands do not deactivate region > diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el > index ec81637ef0..960bab286a 100644 > --- a/lisp/ox-odt.el > +++ b/lisp/ox-odt.el > @@ -94,7 +94,8 @@ (org-export-define-backend 'odt > . (org-odt--translate-latex-fragments > org-odt--translate-description-lists > org-odt--translate-list-tables > - org-odt--translate-image-links))) > + org-odt--translate-image-links)) > + (:filter-final-output . org-odt--remove-forbidden)) > :menu-entry > '(?o "Export to ODT" > ((?o "As ODT file" org-odt-export-to-odt) > @@ -108,6 +109,7 @@ (org-export-define-backend 'odt > (:keywords "KEYWORDS" nil nil space) > (:subtitle "SUBTITLE" nil nil parse) > ;; Other variables. > + (:odt-with-forbidden-chars nil nil org-odt-with-forbidden-chars) > (:odt-content-template-file nil nil org-odt-content-template-file) > (:odt-display-outline-level nil nil org-odt-display-outline-level) > (:odt-fontify-srcblocks nil nil org-odt-fontify-srcblocks) > @@ -170,6 +172,14 @@ (defconst org-odt-special-string-regexps > ("\\.\\.\\." . "…")) ; hellip > "Regular expressions for special string conversion.") > > +(defconst org-odt-forbidden-char-re > + (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D} > + (?\N{U+20} . ?\N{U+D7FF}) > + (?\N{U+E000} . ?\N{U+FFFD}) > + (?\N{U+10000} . ?\N{U+10FFFF})))) > + "Regexp matching forbidden XML1.0 characters. > +https://www.w3.org/TR/REC-xml/#charsets") > + > (defconst org-odt-schema-dir-list > (list (expand-file-name "./schema/" org-odt-data-dir)) > "List of directories to search for OpenDocument schema files. > @@ -364,6 +374,19 @@ (defgroup org-export-odt nil > :tag "Org Export ODT" > :group 'org-export) > > +(defcustom org-odt-with-forbidden-chars "" > + "String to replace forbidden XML characters. > +When set to t, forbidden characters are left as-is. > +When set to nil, an error is thrown. > +See `org-odt-forbidden-char-re' for the list of forbidden characters > +that cannot occur inside ODT documents. > + > +You may also consider export filters to perform more fine-grained > +replacements. See info node `(org)Advanced Export Configuration'." > + :package-version '(Org . "9.8") > + :type '(choice (const :tag "Leave forbidden characters as-is" t) > + (const :tag "Err when forbidden characters encountered" nil) > + (string :tag "Replacement string"))) > > ;;;; Debugging > > @@ -2892,6 +2915,32 @@ (defun org-odt--encode-tabs-and-spaces (line) > (format " <text:s text:c=\"%d\"/>" (1- (length s))))) > line)) > > +(defun org-odt--remove-forbidden (text _backend info) > + "Remove forbidden and discouraged characters from TEXT. > +INFO is the communication plist" > + (pcase-exhaustive (plist-get info :odt-with-forbidden-chars) > + ((and (pred stringp) rep) > + (let ((replacements (make-hash-table :test 'equal))) > + (with-temp-buffer > + (insert text) > + (goto-char (point-min)) > + (while (re-search-forward org-odt-forbidden-char-re nil t) > + (cl-incf (gethash (match-string 0) replacements 0)) > + (replace-match rep)) > + (cl-loop for forbidden being the hash-keys of replacements > + using (hash-values count) > + do (display-warning > + '(ox-odt ox-odt-with-forbidden-chars) > + (format "Replaced forbidden character '%s' with '%s' %d times" > + forbidden rep count))) > + (buffer-string)))) > + (`nil > + (if (string-match org-odt-forbidden-char-re text) > + (error "Forbidden character '%s' found. See `org-odt-with-forbidden-chars'" > + (match-string 0 text)) > + text)) > + ('t text))) > + > (defun org-odt--encode-plain-text (text &optional no-whitespace-filling) > (dolist (pair '(("&" . "&") ("<" . "<") (">" . ">"))) > (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t))) > -- > 2.47.1 LGTM! TIL about clicking on ⛔ and warning-suppress-types. Thank you! Joseph ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-28 9:50 ` Joseph Turner @ 2024-12-28 15:50 ` Ihor Radchenko 0 siblings, 0 replies; 18+ messages in thread From: Ihor Radchenko @ 2024-12-28 15:50 UTC (permalink / raw) To: Joseph Turner; +Cc: Christian Moe, Org Mode Mailing List, Bohong Huang Joseph Turner <joseph@breatheoutbreathe.in> writes: >> Subject: [PATCH v3] ox-odt: Avoid putting forbidden characters into ODT xml > > LGTM! TIL about clicking on ⛔ and warning-suppress-types. Thank you! Thanks for checking! Applied, onto main. https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=e16c9ed54f -- Ihor Radchenko // yantar92, Org mode maintainer, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-24 11:04 ` Christian Moe 2024-12-24 14:14 ` Ihor Radchenko @ 2024-12-24 14:25 ` Max Nikulin 2024-12-24 14:30 ` Ihor Radchenko 1 sibling, 1 reply; 18+ messages in thread From: Max Nikulin @ 2024-12-24 14:25 UTC (permalink / raw) To: emacs-orgmode On 24/12/2024 18:04, Christian Moe wrote: > I don't think it's specific to ODT or LibreOffice, it's the underlying > XML 1.0 spec that "discourages" control characters and does not include > #xC in the range of characters that XML processors must accept. Pandoc retains "^L" in export to markdown, but replaces the line by a space in .odt. I am curious if it is a dedicated output filter or just a feature of XML writer. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Form feed characters break odt export 2024-12-24 14:25 ` Max Nikulin @ 2024-12-24 14:30 ` Ihor Radchenko 0 siblings, 0 replies; 18+ messages in thread From: Ihor Radchenko @ 2024-12-24 14:30 UTC (permalink / raw) To: Max Nikulin; +Cc: emacs-orgmode Max Nikulin <manikulin@gmail.com> writes: > On 24/12/2024 18:04, Christian Moe wrote: >> I don't think it's specific to ODT or LibreOffice, it's the underlying >> XML 1.0 spec that "discourages" control characters and does not include >> #xC in the range of characters that XML processors must accept. > > Pandoc retains "^L" in export to markdown, but replaces the line by a > space in .odt. I am curious if it is a dedicated output filter or just a > feature of XML writer. What about other control characters? Does pandoc also replace them with space? -- Ihor Radchenko // yantar92, Org mode maintainer, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-12-28 15:50 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-12-21 1:48 Form feed characters break odt export Joseph Turner via General discussions about Org-mode. 2024-12-21 3:56 ` Max Nikulin 2024-12-21 6:52 ` Joseph Turner 2024-12-21 7:23 ` Max Nikulin 2024-12-21 19:06 ` Joseph Turner 2024-12-24 16:23 ` Max Nikulin 2024-12-25 10:16 ` Joseph Turner 2024-12-23 17:32 ` Ihor Radchenko 2024-12-24 11:04 ` Christian Moe 2024-12-24 14:14 ` Ihor Radchenko 2024-12-25 10:10 ` Joseph Turner 2024-12-27 10:21 ` Ihor Radchenko 2024-12-27 20:42 ` Joseph Turner 2024-12-28 8:32 ` Ihor Radchenko 2024-12-28 9:50 ` Joseph Turner 2024-12-28 15:50 ` Ihor Radchenko 2024-12-24 14:25 ` Max Nikulin 2024-12-24 14:30 ` Ihor Radchenko
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).