From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0.migadu.com ([2001:41d0:303:e224::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms13.migadu.com with LMTPS id MLWhFJnBameBIAEAqHPOHw:P1 (envelope-from ) for ; Tue, 24 Dec 2024 14:13:45 +0000 Received: from aspmx1.migadu.com ([2001:41d0:303:e224::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0.migadu.com with LMTPS id MLWhFJnBameBIAEAqHPOHw (envelope-from ) for ; Tue, 24 Dec 2024 15:13:45 +0100 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=posteo.net header.s=2017 header.b=prWxvS8T; spf=pass (aspmx1.migadu.com: domain of "emacs-orgmode-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="emacs-orgmode-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=none) header.from=posteo.net ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1735049625; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=5ixq6OAlPX4xKsy+pbjLK2Jj/kXdTtiACByRV1rbhI8=; b=RbECXZcGB61voosncAdrYt5Fdgp+uwWzkWOQTiccEkUeulBDLXakS3ZkY854v4OWvHVCIX RPYfM/IcsSj8dkXojIT2antguFxeonEOvkNcW8avoEKapiqHWP4yZVt9s7TPAXOtab7oG7 A1UDSTy4lb3LQoaRE7rnw/294j3P6Qz/6tgd4uWowjoga/1GDLQJZhOR2t1pQbVqAVa+He 5MPtZ16szVnwdpkglo3unVok0uMJQCBQOXpTG/mQIf855JkBRLvBwDOZg2+Tjp0sqYaYzs vw1Z33cY38+/Xn9LmWp5RfVtZB7vsUhb1aVJ77ITOYyTxGCmeBGgFTrdEeLtlw== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=posteo.net header.s=2017 header.b=prWxvS8T; spf=pass (aspmx1.migadu.com: domain of "emacs-orgmode-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="emacs-orgmode-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=none) header.from=posteo.net ARC-Seal: i=1; s=key1; d=yhetil.org; t=1735049625; a=rsa-sha256; cv=none; b=SKZAtRfRQ4wiqvuHpYOm4KI1nSpVexCalfSCpgLBVLLztUZu2S0dq7WrKN9urBPLg4DGZN OWWBf2Xu7SE9TAHfkE+QU7VMiq9I0EQH5IV6uwXzpWMtl1BorNrfkyDQcDhGXvVdfzfzwy 8ZFftYiMg4pXCukexhN1788yrgxaeJFxpIts5sXc1raUmw30EvjkgMfEswgPaZ4uYUalcX wu+1qyjvkffTikx85ViEwcT1Py+z54jcllGqXvChAHMKHykhTmL2WB3etkhVY6uxMo4yNo RrMiDRXxepNtVAeow7LnL+lDD93v4g8wjGIZDhdNKQ8krynIqtdVdcbo3OGi1Q== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 22CEC33554 for ; Tue, 24 Dec 2024 15:13:45 +0100 (CET) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1tQ5eh-0001Mi-Hi; Tue, 24 Dec 2024 09:13:03 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tQ5eZ-0001MS-LV for emacs-orgmode@gnu.org; Tue, 24 Dec 2024 09:12:55 -0500 Received: from mout02.posteo.de ([185.67.36.66]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tQ5eV-0003ti-08 for emacs-orgmode@gnu.org; Tue, 24 Dec 2024 09:12:55 -0500 Received: from submission (posteo.de [185.67.36.169]) by mout02.posteo.de (Postfix) with ESMTPS id 9D283240101 for ; Tue, 24 Dec 2024 15:12:46 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1735049566; bh=1ql4521nfR/eXaI2OuXiW0wQXsQdLcaYvAoPtG4jEgE=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:Content-Type: From; b=prWxvS8TUPi3Pmr6iOT/VMUFJKQXe+zlfQH4DT7HuRTLHY9tXms40fVDa4kJMES24 stMTWpJiJ46u9gTpZFXxdlZ3xXlPN/hzWEgvWhAlTjzVgVY0/Jv9N1sbXI1p2T8AsO yIA/0o/SRO+WfY7OouEKkZlrgAesuRX21nZ7yXDmsCvMQJ4kSFTYfyb+lDPPhNDgmB Gi+jlLzkv0npUWfLNmqosBJqC+bCkDiTMyMJMVSerT6ezlG1/HDsUhxc3J8U8mxPyW Gh6JwZIpsptfN9vIHXv6lT0E2z6QOEMS8YV1mukUkV8CGd8sedFNFilCKHTaLgAOdT PRKvqHXDyS/qg== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4YHcKn6PTbz6tm8; Tue, 24 Dec 2024 15:12:45 +0100 (CET) From: Ihor Radchenko To: Christian Moe Cc: Joseph Turner , Org Mode Mailing List , Bohong Huang Subject: Re: Form feed characters break odt export In-Reply-To: <87o711l4u4.fsf@christianmoe.com> References: <87ed21hkmi.fsf@breatheoutbreathe.in> <87ikrajoe4.fsf@localhost> <87o711l4u4.fsf@christianmoe.com> Date: Tue, 24 Dec 2024 14:14:17 +0000 Message-ID: <87jzbpgocm.fsf@localhost> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Received-SPF: pass client-ip=185.67.36.66; envelope-from=yantar92@posteo.net; helo=mout02.posteo.de X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, RCVD_IN_VALIDITY_SAFE_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-orgmode@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+larch=yhetil.org@gnu.org Sender: emacs-orgmode-bounces+larch=yhetil.org@gnu.org X-Migadu-Flow: FLOW_IN X-Migadu-Country: US X-Migadu-Scanner: mx12.migadu.com X-Migadu-Spam-Score: -2.08 X-Spam-Score: -2.08 X-Migadu-Queue-Id: 22CEC33554 X-TUID: pe0QYmQWdGHQ --=-=-= Content-Type: text/plain Christian Moe writes: > I don't think it's specific to ODT or LibreOffice, it's the underlying > XML 1.0 spec that "discourages" control characters and does not include > #xC in the range of characters that XML processors must accept. > > Spec: https://www.w3.org/TR/REC-xml/#charsets > > Some discussion: > https://stackoverflow.com/questions/404107/why-are-control-characters-illegal-in-xml-1-0 Thanks! Then, we can simply remove the disallowed characters. See the attached tentative patch. --=-=-= Content-Type: text/x-patch Content-Disposition: inline; filename=0001-ox-odt-Avoid-putting-forbidden-characters-into-ODT-x.patch >From 5c9fdd9df32b87be9f81e037336332984bc3b16c Mon Sep 17 00:00:00 2001 Message-ID: <5c9fdd9df32b87be9f81e037336332984bc3b16c.1735049605.git.yantar92@posteo.net> From: Ihor Radchenko Date: Tue, 24 Dec 2024 15:11:22 +0100 Subject: [PATCH] ox-odt: Avoid putting forbidden characters into ODT xml * lisp/ox-odt.el (org-odt-forbidden-char-re): (org-odt-discouraged-char-re): New constants codifying characters that are prohibited in XML spec. (org-odt--remove-forbidden): New function removing the prohibited characters. (org-odt--encode-plain-text): Remove the prohibited characters. (org-odt-plain-text): Update comment. Reported-by: Joseph Turner Link: https://orgmode.org/list/87o711l4u4.fsf@christianmoe.com --- lisp/ox-odt.el | 38 +++++++++++++++++++++++++++++++++++--- 1 file changed, 35 insertions(+), 3 deletions(-) diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el index ec81637ef0..61c8d4ec75 100644 --- a/lisp/ox-odt.el +++ b/lisp/ox-odt.el @@ -170,6 +170,28 @@ (defconst org-odt-special-string-regexps ("\\.\\.\\." . "…")) ; hellip "Regular expressions for special string conversion.") +(defconst org-odt-forbidden-char-re + (rx (not (in ?\N{U+9} ?\N{U+A} ?\N{U+D} + (?\N{U+20} . ?\N{U+D7FF}) + (?\N{U+E000} . ?\N{U+FFFD}) + (?\N{U+10000} . ?\N{U+10FFFF})))) + "Regexp matching forbidden XML1.0 characters. +https://www.w3.org/TR/REC-xml/#charsets") + +(defconst org-odt-discouraged-char-re + (rx (in (?\N{U+7F} . ?\N{U+84}) (?\N{U+86} . ?\N{U+9F}) + (?\N{U+FDD0} . ?\N{U+FDEF}) (?\N{U+1FFFE} . ?\N{U+1FFFF}) + (?\N{U+2FFFE} . ?\N{U+2FFFF}) (?\N{U+3FFFE} . ?\N{U+3FFFF}) + (?\N{U+4FFFE} . ?\N{U+4FFFF}) (?\N{U+5FFFE} . ?\N{U+5FFFF}) + (?\N{U+6FFFE} . ?\N{U+6FFFF}) (?\N{U+7FFFE} . ?\N{U+7FFFF}) + (?\N{U+8FFFE} . ?\N{U+8FFFF}) (?\N{U+9FFFE} . ?\N{U+9FFFF}) + (?\N{U+AFFFE} . ?\N{U+AFFFF}) (?\N{U+BFFFE} . ?\N{U+BFFFF}) + (?\N{U+CFFFE} . ?\N{U+CFFFF}) (?\N{U+DFFFE} . ?\N{U+DFFFF}) + (?\N{U+EFFFE} . ?\N{U+EFFFF}) (?\N{U+FFFFE} . ?\N{U+FFFFF}) + (?\N{U+10FFFE} . ?\N{U+10FFFF}))) + "Regexp matching discouraged XML1.0 characters. +https://www.w3.org/TR/REC-xml/#charsets") + (defconst org-odt-schema-dir-list (list (expand-file-name "./schema/" org-odt-data-dir)) "List of directories to search for OpenDocument schema files. @@ -2892,18 +2914,28 @@ (defun org-odt--encode-tabs-and-spaces (line) (format " " (1- (length s))))) line)) +(defun org-odt--remove-forbidden (text) + "Remove forbidden and discouraged characters from TEXT. +https://www.w3.org/TR/REC-xml/#charsets" + (replace-regexp-in-string + org-odt-forbidden-char-re "" + (replace-regexp-in-string + org-odt-discouraged-char-re "" + text))) + (defun org-odt--encode-plain-text (text &optional no-whitespace-filling) (dolist (pair '(("&" . "&") ("<" . "<") (">" . ">"))) (setq text (replace-regexp-in-string (car pair) (cdr pair) text t t))) - (if no-whitespace-filling text - (org-odt--encode-tabs-and-spaces text))) + (org-odt--remove-forbidden + (if no-whitespace-filling text + (org-odt--encode-tabs-and-spaces text)))) (defun org-odt-plain-text (text info) "Transcode a TEXT string from Org to ODT. TEXT is the string to transcode. INFO is a plist holding contextual information." (let ((output text)) - ;; Protect &, < and >. + ;; Protect &, < and >, and remove forbidden characters. (setq output (org-odt--encode-plain-text output t)) ;; Handle smart quotes. Be sure to provide original string since ;; OUTPUT may have been modified. -- 2.47.1 --=-=-= Content-Type: text/plain -- Ihor Radchenko // yantar92, Org mode maintainer, Learn more about Org mode at . Support Org development at , or support my work at --=-=-=--