* Orgmode → ODT: Certain chars break export @ 2015-02-13 10:45 Tory S. Anderson 2015-02-13 11:04 ` Rasmus 2015-02-14 8:50 ` Vaidheeswaran 0 siblings, 2 replies; 8+ messages in thread From: Tory S. Anderson @ 2015-02-13 10:45 UTC (permalink / raw) To: orgmode list While we're on the topic of ODT export problems: I was in the process of converting PDF to Text to Org to ODT/DocX and discovered that certain characters seem to break exported odt documents, which fail with a line and col number. So far the only one I know for sure is the "\f" (Char: C-l (12, #o14, #xc)). Hopefully a single fix can handle all such cases. You probably don't need it, but I verified with the following file: http://toryanderson.com/files/breakorg.org Org-mode version 8.2.10 (8.2.10-32-gddaa1d-elpa) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Orgmode → ODT: Certain chars break export 2015-02-13 10:45 Orgmode → ODT: Certain chars break export Tory S. Anderson @ 2015-02-13 11:04 ` Rasmus 2015-02-13 15:18 ` Tory S. Anderson 2015-02-14 8:50 ` Vaidheeswaran 1 sibling, 1 reply; 8+ messages in thread From: Rasmus @ 2015-02-13 11:04 UTC (permalink / raw) To: emacs-orgmode torys.anderson@gmail.com (Tory S. Anderson) writes: > While we're on the topic of ODT export problems: I was in the process > of converting PDF to Text to Org to ODT/DocX and discovered that > certain characters seem to break exported odt documents, which fail > with a line and col number. So far the only one I know for sure is the > "\f" (Char: C-l (12, #o14, #xc)). Hopefully a single fix can handle > all such cases. > > You probably don't need it, but I verified with the following file: > http://toryanderson.com/files/breakorg.org The export is fine, but the produced XML is invalid since it contains an illegal character. But how to resolve this? Should ox strip illegal charterers (if so what are they)? If so, could they be used for entities? —Rasmus -- I hear there's rumors on the, uh, Internets. . . ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Orgmode → ODT: Certain chars break export 2015-02-13 11:04 ` Rasmus @ 2015-02-13 15:18 ` Tory S. Anderson 2015-02-13 16:07 ` Rasmus 0 siblings, 1 reply; 8+ messages in thread From: Tory S. Anderson @ 2015-02-13 15:18 UTC (permalink / raw) To: emacs-orgmode From a user perspective just stripping the characters seems best to me, but finding out what the characters seems obnoxious. Neither a quick search nor skimming the ODT doc specification[1][2] seem to give any insight into a set of illegal characters. Does elisp have anything similar to Java's "isWhitespace"[3] that could be used to check character features? Rasmus <rasmus@gmx.us> writes: > torys.anderson@gmail.com (Tory S. Anderson) writes: > >> While we're on the topic of ODT export problems: I was in the process >> of converting PDF to Text to Org to ODT/DocX and discovered that >> certain characters seem to break exported odt documents, which fail >> with a line and col number. So far the only one I know for sure is the >> "\f" (Char: C-l (12, #o14, #xc)). Hopefully a single fix can handle >> all such cases. >> >> You probably don't need it, but I verified with the following file: >> http://toryanderson.com/files/breakorg.org > > The export is fine, but the produced XML is invalid since it contains an > illegal character. But how to resolve this? Should ox strip illegal > charterers (if so what are they)? If so, could they be used for entities? > > —Rasmus Footnotes: [1] https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office [2] http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html#__RefHeading__1415196_253892949 [3] http://www.fileformat.info/info/unicode/char/000c/index.htm ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Orgmode → ODT: Certain chars break export 2015-02-13 15:18 ` Tory S. Anderson @ 2015-02-13 16:07 ` Rasmus 2015-02-13 16:41 ` Tory S. Anderson 0 siblings, 1 reply; 8+ messages in thread From: Rasmus @ 2015-02-13 16:07 UTC (permalink / raw) To: emacs-orgmode torys.anderson@gmail.com (Tory S. Anderson) writes: > From a user perspective just stripping the characters seems best to > me, but finding out what the characters seems obnoxious. But maybe there is a valid way to represent such characters in XML? At the very least entities must be replaced before stripping these... > Neither a quick search nor skimming the ODT doc specification[1][2] seem > to give any insight into a set of illegal characters. Does elisp have > anything similar to Java's "isWhitespace"[3] that could be used to check > character features? It's an XML thing. When I tried to open the contents.xml with Firefox it also says broken XML. But I also don't know which are the characters that are not supported by XML. —Rasmus -- This space is left intentionally blank ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Orgmode → ODT: Certain chars break export 2015-02-13 16:07 ` Rasmus @ 2015-02-13 16:41 ` Tory S. Anderson 2015-02-14 1:18 ` Rasmus 0 siblings, 1 reply; 8+ messages in thread From: Tory S. Anderson @ 2015-02-13 16:41 UTC (permalink / raw) To: Rasmus; +Cc: emacs-orgmode There is a helpful wiki page now that you found XML; it even mentions my specific character.[1] The main source seems to be at the w3.org spec.[2] Rasmus <rasmus@gmx.us> writes: > torys.anderson@gmail.com (Tory S. Anderson) writes: > >> From a user perspective just stripping the characters seems best to >> me, but finding out what the characters seems obnoxious. > > But maybe there is a valid way to represent such characters in XML? At > the very least entities must be replaced before stripping these... > >> Neither a quick search nor skimming the ODT doc specification[1][2] seem >> to give any insight into a set of illegal characters. Does elisp have >> anything similar to Java's "isWhitespace"[3] that could be used to check >> character features? > > It's an XML thing. When I tried to open the contents.xml with Firefox it > also says broken XML. But I also don't know which are the characters that > are not supported by XML. > > —Rasmus Footnotes: [1] https://en.wikipedia.org/wiki/Valid_characters_in_XML#XML_1.1 [2] http://www.w3.org/TR/xml11/#charsets ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Orgmode → ODT: Certain chars break export 2015-02-13 16:41 ` Tory S. Anderson @ 2015-02-14 1:18 ` Rasmus 0 siblings, 0 replies; 8+ messages in thread From: Rasmus @ 2015-02-14 1:18 UTC (permalink / raw) To: emacs-orgmode torys.anderson@gmail.com (Tory S. Anderson) writes: > There is a helpful wiki page now that you found XML; it even mentions > my specific character.[1] The main source seems to be at the w3.org > spec.[2] I don't understand unicode well enough to propose a solution. For now you could use a org-export-before-parsing-hook or org-export-filter-final-output-functions or maybe org-export-filter-body-functions to solve the issue locally. —Rasmus -- Governments should be afraid of their people ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Orgmode → ODT: Certain chars break export 2015-02-13 10:45 Orgmode → ODT: Certain chars break export Tory S. Anderson 2015-02-13 11:04 ` Rasmus @ 2015-02-14 8:50 ` Vaidheeswaran 2015-02-14 10:43 ` Vaidheeswaran 1 sibling, 1 reply; 8+ messages in thread From: Vaidheeswaran @ 2015-02-14 8:50 UTC (permalink / raw) Cc: orgmode list On Friday 13 February 2015 04:15 PM, Tory S. Anderson wrote: > While we're on the topic of ODT export problems: I was in the process of converting PDF to Text to Org to ODT/DocX and discovered that certain characters seem to break exported odt documents, which fail with a line and col number. So far the only one I know for sure is the "\f" (Char: C-l (12, #o14, #xc)). Hopefully a single fix can handle all such cases. > > You probably don't need it, but I verified with the following file: > http://toryanderson.com/files/breakorg.org > > Org-mode version 8.2.10 (8.2.10-32-gddaa1d-elpa) > > I assume that you are using pdftotext. In that case, you can use the following argument. -nopgbrk : don't insert page breaks between pages That said, it is very difficult to say what the right action should be when encountering ^L or other problematic characters. Much depends on the context. Neither an outright removal, or replacement with a single SPC, a NEWLINE or a double NEWLINE may be satisfactory. Specifically, in the pdftotext case above, I believe the best action would be to M-x flush-lines that match ^L so that page headers are stripped. ---------------------------------------------------------------- From exporter side of things, the best that one could do is to catch such exceptional cases and report it to the user for further repair. i.e., Instead of waiting of LibreOffice to catch this exception and leave the user in utter confusion, the export backend could catch the error early in the export process and report a useful message. A variation of following snippet can be used for catching the error early. (add-hook 'org-export-before-parsing-hook (lambda (backend) (when (eq backend 'odt) (goto-char (point-min)) (when (re-search-forward (rx-to-string '(or (in (#x0 . #x8)) (in (#xB . #xC)) (in (#xE. #x1F)) (in (#xD800. #xDFFF)) (in (#xFFFE . #xFFFF)) (in (#x110000 . #x3FFFFF)))) nil t) (user-error "Input file has a problematic char [%s]." (format "#x%x" (string-to-char (match-string 0)))))))) The following snippet could be used for outright removal of problematic characters. (add-hook 'org-export-before-parsing-hook (lambda (backend) (when (eq backend 'odt) (goto-char (point-min)) (when (re-search-forward (rx-to-string '(one-or-more (or (in (#x0 . #x8)) (in (#xB . #xC)) (in (#xE. #x1F)) (in (#xD800. #xDFFF)) (in (#xFFFE . #xFFFF)) (in (#x110000 . #x3FFFFF))))) nil t) (replace-match "" t t))))) ---------------------------------------------------------------- Note to the developers: 1. xmltok.el has `xmltok-valid-char-p'. 2. From http://www.w3.org/TR/2008/REC-xml-20081126/#charsets /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] Document authors are encouraged to avoid "compatibility characters", as defined in section 2.3 of [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters: [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF], [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF], [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF], [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF], [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF], [#x10FFFE-#x10FFFF]. ---------------------------------------------------------------- ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Orgmode → ODT: Certain chars break export 2015-02-14 8:50 ` Vaidheeswaran @ 2015-02-14 10:43 ` Vaidheeswaran 0 siblings, 0 replies; 8+ messages in thread From: Vaidheeswaran @ 2015-02-14 10:43 UTC (permalink / raw) To: emacs-orgmode On Saturday 14 February 2015 02:20 PM, Vaidheeswaran wrote: > Specifically, in the pdftotext case above, I believe the best action > would be to M-x flush-lines that match ^L so that page headers are > stripped. I was writing from memory. I should have said this instead: The best action would be to flush page headers 'surrounding' ^L and to 'splice' the paragraph lines (that are split apart) at the pagebreaks. Essentially, for right repair, human intervention is a rule rather than an exception. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-02-14 10:41 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-02-13 10:45 Orgmode → ODT: Certain chars break export Tory S. Anderson 2015-02-13 11:04 ` Rasmus 2015-02-13 15:18 ` Tory S. Anderson 2015-02-13 16:07 ` Rasmus 2015-02-13 16:41 ` Tory S. Anderson 2015-02-14 1:18 ` Rasmus 2015-02-14 8:50 ` Vaidheeswaran 2015-02-14 10:43 ` Vaidheeswaran
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).