From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0.migadu.com ([2001:41d0:303:e224::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms13.migadu.com with LMTPS id ALyYGUnaa2eXsQAAqHPOHw:P1 (envelope-from ) for ; Wed, 25 Dec 2024 10:11:21 +0000 Received: from aspmx1.migadu.com ([2001:41d0:303:e224::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0.migadu.com with LMTPS id ALyYGUnaa2eXsQAAqHPOHw (envelope-from ) for ; Wed, 25 Dec 2024 11:11:21 +0100 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=breatheoutbreathe.in header.s=key1 header.b=OJSjO+ji; spf=pass (aspmx1.migadu.com: domain of "emacs-orgmode-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="emacs-orgmode-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=quarantine) header.from=breatheoutbreathe.in ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1735121480; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=gCh5q3QUqag0WObClLvREQ6wdKkdi3+AB0kr6F65kI0=; b=qXbz8uRbeofMTdlY6+6v6DT4FMs2DQIlbv2otTlDPE0XET7Dd0ynL9Cvpis7VAh43h1Jxf JVaV5SewKIIW5Kmb8oEcpcB1o7Lnv1jrBXLCCz+Zot0P3RGaQjXCP0yt1JIVn4AMeJGomd LsHSDX5FcPOyNuYyhTWQPsC206Ixwe9Pyguy2taSYIvU+IY6QV+nKQEY2Tw+BzNKErBxUJ PSx9jQoPscdymQJRvHIlxZeWEYnfznxzQn51SX6b04nrjLR/5GI4tUEpLczqSkELDvlFAn POoGMbeWteYCQGNhJkmEKhBZF2OUB05nf0zNva4CvuQKmZ4XUCOcsLQ1KBlzXg== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=breatheoutbreathe.in header.s=key1 header.b=OJSjO+ji; spf=pass (aspmx1.migadu.com: domain of "emacs-orgmode-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="emacs-orgmode-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=quarantine) header.from=breatheoutbreathe.in ARC-Seal: i=1; s=key1; d=yhetil.org; t=1735121480; a=rsa-sha256; cv=none; b=cb1R+4zM5p64F0LdDqcPi53QUnEgOVtr6m7YUr5c3tSBwyA1GqYhj3lh3GxxTnbctu/Dxa HyEGHHPQOalF2nwru5pRanrdQZXUkNR3EFOdS5ZV0uQZsV9zyt43g+rWc5+REo4S68ejG+ pNSwsiKBRAQiQXFCE/mrmDxF0xKIsFPQFV0FgF/o5XWpyCyariu044Ea/oycbaAWprwnzB o6MbHlgP+mLyS5KbEFdR/7wSSaZGNptOT1mqwvtT8t1cKo7S7UFgUQRNINWO13Y5BfH2rl Cv0aQ+g1l+ajCTeNIMqB565ThmEP7zK1acpN1bsBE2VDj+nu9x6eWNOSNdXIBA== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 8882031007 for ; Wed, 25 Dec 2024 11:11:20 +0100 (CET) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1tQOLW-0002Xl-7P; Wed, 25 Dec 2024 05:10:30 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tQOLU-0002XM-AZ for emacs-orgmode@gnu.org; Wed, 25 Dec 2024 05:10:28 -0500 Received: from out-173.mta0.migadu.com ([2001:41d0:1004:224b::ad]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tQOLP-0008BV-SG for emacs-orgmode@gnu.org; Wed, 25 Dec 2024 05:10:27 -0500 X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=breatheoutbreathe.in; s=key1; t=1735121417; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=gCh5q3QUqag0WObClLvREQ6wdKkdi3+AB0kr6F65kI0=; b=OJSjO+jiZiPH3YDRbrE4/gAW82oMoeIsDqV2uznVSwgTKPWso5kCs5AGxQzAXmhLlEY/uc kXpx430AUQTP1ztqj5bsgjAwzuihuMDFoy50sj1va+2kMLJZRa+Ug6BJ85juk5VqMaMRvA QQgbt/KDjHGszTEdCFG9Bb8y9VkVJ4Q= From: Joseph Turner To: Ihor Radchenko Cc: Christian Moe , Org Mode Mailing List , Bohong Huang Subject: Re: Form feed characters break odt export In-Reply-To: <87jzbpgocm.fsf@localhost> (Ihor Radchenko's message of "Tue, 24 Dec 2024 14:14:17 +0000") References: <87ed21hkmi.fsf@breatheoutbreathe.in> <87ikrajoe4.fsf@localhost> <87o711l4u4.fsf@christianmoe.com> <87jzbpgocm.fsf@localhost> Date: Wed, 25 Dec 2024 02:10:14 -0800 Message-ID: <878qs4oyyh.fsf@breatheoutbreathe.in> MIME-Version: 1.0 Content-Type: text/plain Received-SPF: pass client-ip=2001:41d0:1004:224b::ad; envelope-from=joseph@breatheoutbreathe.in; helo=out-173.mta0.migadu.com X-Spam_score_int: -16 X-Spam_score: -1.7 X-Spam_bar: - X-Spam_report: (-1.7 / 5.0 requ) BAYES_00=-1.9, DKIM_INVALID=0.1, DKIM_SIGNED=0.1, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-orgmode@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+larch=yhetil.org@gnu.org Sender: emacs-orgmode-bounces+larch=yhetil.org@gnu.org X-Migadu-Flow: FLOW_IN X-Migadu-Country: US X-Migadu-Scanner: mx12.migadu.com X-Migadu-Spam-Score: -3.39 X-Spam-Score: -3.39 X-Migadu-Queue-Id: 8882031007 X-TUID: P9/iR0G5iHeB Ihor Radchenko writes: > Christian Moe writes: > >> I don't think it's specific to ODT or LibreOffice, it's the underlying >> XML 1.0 spec that "discourages" control characters and does not include >> #xC in the range of characters that XML processors must accept. >> >> Spec: https://www.w3.org/TR/REC-xml/#charsets >> >> Some discussion: >> https://stackoverflow.com/questions/404107/why-are-control-characters-illegal-in-xml-1-0 > > Thanks! > Then, we can simply remove the disallowed characters. > See the attached tentative patch. > > From 5c9fdd9df32b87be9f81e037336332984bc3b16c Mon Sep 17 00:00:00 2001 > Message-ID: <5c9fdd9df32b87be9f81e037336332984bc3b16c.1735049605.git.yantar92@posteo.net> > From: Ihor Radchenko > Date: Tue, 24 Dec 2024 15:11:22 +0100 > Subject: [PATCH] ox-odt: Avoid putting forbidden characters into ODT xml > > * lisp/ox-odt.el (org-odt-forbidden-char-re): > (org-odt-discouraged-char-re): New constants codifying characters that > are prohibited in XML spec. > (org-odt--remove-forbidden): New function removing the prohibited > characters. > (org-odt--encode-plain-text): Remove the prohibited characters. > (org-odt-plain-text): Update comment. > > Reported-by: Joseph Turner > Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com > --- > lisp/ox-odt.el | 38 +++++++++++++++++++++++++++++++++++--- > 1 file changed, 35 insertions(+), 3 deletions(-) > > diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el > index ec81637ef0..61c8d4ec75 100644 > --- a/lisp/ox-odt.el > +++ b/lisp/ox-odt.el > @@ -170,6 +170,28 @@ (defconst org-odt-special-string-regexps > ("\\.\\.\\." . "…")) ; hellip > "Regular expressions for special string conversion.") > > +(defconst org-odt-forbidden-char-re > + (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D} > + (?\N{U+20} . ?\N{U+D7FF}) > + (?\N{U+E000} . ?\N{U+FFFD}) > + (?\N{U+10000} . ?\N{U+10FFFF})))) > + "Regexp matching forbidden XML1.0 characters. > +https://www.w3.org/TR/REC-xml/#charsets") > + > +(defconst org-odt-discouraged-char-re > + (rx (in (?\N{U+7F} . ?\N{U+84}) (?\N{U+86} . ?\N{U+9F}) > + (?\N{U+FDD0} . ?\N{U+FDEF}) (?\N{U+1FFFE} . ?\N{U+1FFFF}) > + (?\N{U+2FFFE} . ?\N{U+2FFFF}) (?\N{U+3FFFE} . ?\N{U+3FFFF}) > + (?\N{U+4FFFE} . ?\N{U+4FFFF}) (?\N{U+5FFFE} . ?\N{U+5FFFF}) > + (?\N{U+6FFFE} . ?\N{U+6FFFF}) (?\N{U+7FFFE} . ?\N{U+7FFFF}) > + (?\N{U+8FFFE} . ?\N{U+8FFFF}) (?\N{U+9FFFE} . ?\N{U+9FFFF}) > + (?\N{U+AFFFE} . ?\N{U+AFFFF}) (?\N{U+BFFFE} . ?\N{U+BFFFF}) > + (?\N{U+CFFFE} . ?\N{U+CFFFF}) (?\N{U+DFFFE} . ?\N{U+DFFFF}) > + (?\N{U+EFFFE} . ?\N{U+EFFFF}) (?\N{U+FFFFE} . ?\N{U+FFFFF}) > + (?\N{U+10FFFE} . ?\N{U+10FFFF}))) > + "Regexp matching discouraged XML1.0 characters. > +https://www.w3.org/TR/REC-xml/#charsets") > + > (defconst org-odt-schema-dir-list > (list (expand-file-name "./schema/" org-odt-data-dir)) > "List of directories to search for OpenDocument schema files. > @@ -2892,18 +2914,28 @@ (defun org-odt--encode-tabs-and-spaces (line) > (format " " (1- (length s))))) > line)) > > +(defun org-odt--remove-forbidden (text) > + "Remove forbidden and discouraged characters from TEXT. > +https://www.w3.org/TR/REC-xml/#charsets" > + (replace-regexp-in-string > + org-odt-forbidden-char-re "" > + (replace-regexp-in-string > + org-odt-discouraged-char-re "" > + text))) > + > (defun org-odt--encode-plain-text (text &optional no-whitespace-filling) > (dolist (pair '(("&" . "&") ("<" . "<") (">" . ">"))) > (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t))) > - (if no-whitespace-filling text > - (org-odt--encode-tabs-and-spaces text))) > + (org-odt--remove-forbidden > + (if no-whitespace-filling text > + (org-odt--encode-tabs-and-spaces text)))) > > (defun org-odt-plain-text (text info) > "Transcode a TEXT string from Org to ODT. > TEXT is the string to transcode. INFO is a plist holding > contextual information." > (let ((output text)) > - ;; Protect &, < and >. > + ;; Protect &, < and >, and remove forbidden characters. > (setq output (org-odt--encode-plain-text output t)) > ;; Handle smart quotes. Be sure to provide original string since > ;; OUTPUT may have been modified. > -- > 2.47.1 Thanks, Ihor! Tested working on my machine. Here's another potential solution to consider, which adds a defcustom to let the user decide how to handle forbidden characters: https://github.com/kjambunathan/org-mode-ox-odt/commit/07fde1e9b7cdda3e3ef8136f5b1d478499dfd780 Joseph