From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Ricardo Wurmus Newsgroups: gmane.lisp.guile.bugs Subject: bug#20339: sxml simple: sxml->xml mishandles namespaces? Date: Mon, 04 Feb 2019 21:44:02 +0100 Message-ID: <87a7jbi8rx.fsf@elephly.net> References: <20150415194714.GA30295@tuxteam.de> <87y45vln0f.fsf@pobox.com> <20160713132403.GA2349@tuxteam.de> <87furc1qeu.fsf@pobox.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="247120"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: mu4e 1.0; emacs 26.1 Cc: 20339@debbugs.gnu.org To: Andy Wingo Original-X-From: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Mon Feb 04 21:45:12 2019 Return-path: Envelope-to: guile-bugs@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gql78-0011x7-AN for guile-bugs@m.gmane.org; Mon, 04 Feb 2019 21:45:10 +0100 Original-Received: from localhost ([127.0.0.1]:49109 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gql77-0005rS-Ba for guile-bugs@m.gmane.org; Mon, 04 Feb 2019 15:45:09 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:39519) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gql72-0005rH-7M for bug-guile@gnu.org; Mon, 04 Feb 2019 15:45:05 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gql70-0000mZ-Lm for bug-guile@gnu.org; Mon, 04 Feb 2019 15:45:04 -0500 Original-Received: from debbugs.gnu.org ([209.51.188.43]:60462) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1gql70-0000mF-9l for bug-guile@gnu.org; Mon, 04 Feb 2019 15:45:02 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1gql70-0007ja-07 for bug-guile@gnu.org; Mon, 04 Feb 2019 15:45:02 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Ricardo Wurmus Original-Sender: "Debbugs-submit" Resent-CC: bug-guile@gnu.org Resent-Date: Mon, 04 Feb 2019 20:45:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 20339 X-GNU-PR-Package: guile Original-Received: via spool by 20339-submit@debbugs.gnu.org id=B20339.154931307929681 (code B ref 20339); Mon, 04 Feb 2019 20:45:01 +0000 Original-Received: (at 20339) by debbugs.gnu.org; 4 Feb 2019 20:44:39 +0000 Original-Received: from localhost ([127.0.0.1]:59743 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gql6a-0007id-TL for submit@debbugs.gnu.org; Mon, 04 Feb 2019 15:44:38 -0500 Original-Received: from sender-of-o51.zoho.com ([135.84.80.216]:21001) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gql6W-0007iQ-Nv for 20339@debbugs.gnu.org; Mon, 04 Feb 2019 15:44:35 -0500 ARC-Seal: i=1; a=rsa-sha256; t=1549313048; cv=none; d=zoho.com; s=zohoarc; b=aEM43GiCOLAoo3/H7giGfKs8upF4UZi0os8gj4YEBc5z2rKyMvllEkEQwuGu3/ISB4LjfNczJUX5lfhn6rJKXxyon8g3DnHHkjyzaWn5J4G9WCKAe2JTW2M/K4v6VN+4LlTBFbS1kCaR/ZnTNQxbMgjkZNih7xcLGCUHRNSpHqU= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zoho.com; s=zohoarc; t=1549313048; h=Content-Type:Cc:Date:From:In-Reply-To:MIME-Version:Message-ID:References:Subject:To:ARC-Authentication-Results; bh=bCrbMDqVhJDVvpeXEH9CHrQC9t2ZELPeqFUK2NaOzFE=; b=msWUNiweN4x1qkx4lIm9WPMK12EWZWP69cSnC5PEAc70QWTM91Wx4GIdcPLqlVpKEl6+EYmAtt3FLS+qxDH6rjdVV1ycShC5aj0pxF6BRV2+sBW9yan2BFtEc/MhHWdbsBW+cVJvSj2VBnhUz68tNNqDWBg7u1XdDcGQ/eB9JdA= ARC-Authentication-Results: i=1; mx.zoho.com; dkim=pass header.i=elephly.net; spf=pass smtp.mailfrom=rekado@elephly.net; dmarc=pass header.from= header.from= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1549313048; s=zoho; d=elephly.net; i=rekado@elephly.net; h=References:From:To:Cc:Subject:In-reply-to:Date:Message-ID:MIME-Version:Content-Type; l=7804; bh=bCrbMDqVhJDVvpeXEH9CHrQC9t2ZELPeqFUK2NaOzFE=; b=PEZScoHbQjYgL9ONk/wwysvvK2XpWEPfkdc6yFBKyEYEhTgrdIVCaRS3/dxo+Pgc 4dOSEcoEPyKEco85HXYPHuNuI+iNMP4nVrVl4HRVsfJrpNc2gWteiLgRLgaHTJwVw/D e+zDWvHtSapihB1fbxq5y/6ZhJzfLNrZ+NrzhT4w= Original-Received: from localhost (p578E68C8.dip0.t-ipconnect.de [87.142.104.200]) by mx.zohomail.com with SMTPS id 1549313046910741.4151801955618; Mon, 4 Feb 2019 12:44:06 -0800 (PST) In-reply-to: <87furc1qeu.fsf@pobox.com> X-URL: https://elephly.net X-PGP-Key: https://elephly.net/rekado.pubkey X-PGP-Fingerprint: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC X-ZohoMailClient: External X-Zoho-Virus-Status: 1 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 209.51.188.43 X-BeenThere: bug-guile@gnu.org List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Original-Sender: "bug-guile" Xref: news.gmane.org gmane.lisp.guile.bugs:9298 Archived-At: --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hello! I just looked at this again and I think I came with something useful. Here=E2=80=99s some context: Andy Wingo writes: > Hi :) > > On Wed 13 Jul 2016 15:24, tomas@tuxteam.de writes: > >> Referring to Oleg Kiseliov's paper [1], there are actually three >> things involved: > > This summary is helpful, thanks. >> What is missing? From my point of view: >> >> - At xml->sxml time, the user doesn't know which namespaces >> are in the xml. So it would be nice if the XML parser >> could provide that. > > For some documents you do know, of course. > > And for larger perspective, I think that SSAX gives you all the tools > you need to build specialist and very flexible XML parsers. So to an > extent solving the general problem isn't necessary -- we can always > point people to SSAX. But that's a bit rude ;) so if there are common > patterns we should try to capture them in xml->sxml. I see this bug as > being a search for those patterns, but without the requirement of > solving the problem in its most general form. > >> - It would be super-nice if the XML parser could put that >> into the same nodes it found it, as described in [1] >> (i.e. in the (*NAMESPACES* ...) pseudo-attribute). >> This way we wouldn't have a global mapping, but one >> that resembles the original XML, even with the same >> prefixes. Less surprises overall. The round trip >> xml -> sxml -> xml would be (nearly) the identity. >> >> With Ricardo's patch it would lump all the namespace >> declarations up in the top node, which formally is >> correct, but might scare XML people a bit :-) > > ACK. > >> - At sxml->xml time there should be a way to somehow >> generate prefixex for "new" namespaces. I don't know >> at the moment how this would work, that depends on >> how the user is supposed to insert new nodes in the >> SXML. Does she specify the namespace? Both prefix >> (aka namespace-id, under my current assumption) *and* >> namespace? (note that the namespace-id/prefix alone >> wouldn't be sufficient). > > ACK. > > What do you think the next step is? I am happy to wait FWIW, dunno if > Ricardo has any feelings here. Attached is a patch that does the requested things. The parser procedures like FINISH-ELEMENT have access to all the namespaces, so we I changed the FINISH-ELEMENT procedure to return the list of namespaces in addition to its SXML tree return value. I changed name->sxml to use only the namespace aliases / abbreviations instead of the namespace URIs. (This is not very efficient because we need to traverse the list of namespaces every time. Maybe we could memoize this. On the other hand, the length of the namespaces list may not be large enough to affect performance too much.) In the end we get both namespace list and SXML tree from running the parser. Before wrapping this up in *TOP* we generate xmlns attributes for all abbreviations and =E2=80=9Cpatch=E2=80=9D the first proper element= =E2=80=99s attribute list (i.e. we skip over a *PI* element if it exists). The result is an SXML tree that begins with namespace declarations, mapping abbreviations to URIs. Within the SXML tree we=E2=80=99re only usi= ng abbreviations, so there are no more invalid characters when converting SXML to a string. I would be happy if you could test this as I=E2=80=99m not 100% confident t= hat this is correct. Here are questions I wasn=E2=80=99t able to answer conclusively: * Is the value for =E2=80=9Cnamespaces=E2=80=9D that=E2=80=99s passed in to= the FINISH-ELEMENT procedure always the same? * Will the second return value of the final call to FINISH-ELEMENT really always be the complete list of *all* namespaces that have been encountered? * Are there valid XML documents for which the match patterns to inject namespace declarations would not apply? (e.g. documents with a PI element and two separate XML trees) -- Ricardo --=-=-= Content-Type: text/x-patch Content-Disposition: inline; filename=0001-sxml-xml-sxml-Record-and-use-namespace-abbreviations.patch >From 83ee9de18a0ecaa237eb73e1b75d0b21e3e8d321 Mon Sep 17 00:00:00 2001 From: Ricardo Wurmus Date: Mon, 4 Feb 2019 21:39:06 +0100 Subject: [PATCH] sxml: xml->sxml: Record and use namespace abbreviations. * module/sxml/simple.scm (xml->sxml): Add namespace declarations to the attribute list of the first XML element. [name->sxml]: Accept namespaces argument to look up abbreviation. Return name with abbreviation prefix. [parser]: Let FINISH-ELEMENT procedure return namespaces in addition to SXML tree. --- module/sxml/simple.scm | 50 +++++++++++++++++++++++++++++++++--------- 1 file changed, 40 insertions(+), 10 deletions(-) diff --git a/module/sxml/simple.scm b/module/sxml/simple.scm index 703ad9137..52dd9af12 100644 --- a/module/sxml/simple.scm +++ b/module/sxml/simple.scm @@ -1,7 +1,8 @@ ;;;; (sxml simple) -- a simple interface to the SSAX parser ;;;; -;;;; Copyright (C) 2009, 2010, 2013 Free Software Foundation, Inc. +;;;; Copyright (C) 2009, 2010, 2013, 2019 Free Software Foundation, Inc. ;;;; Modified 2004 by Andy Wingo . +;;;; Modified 2019 by Ricardo Wurmus . ;;;; Originally written by Oleg Kiselyov as SXML-to-HTML.scm. ;;;; ;;;; This library is free software; you can redistribute it and/or @@ -30,6 +31,7 @@ #:use-module (sxml ssax) #:use-module (sxml transform) #:use-module (ice-9 match) + #:use-module (srfi srfi-1) #:use-module (srfi srfi-13) #:export (xml->sxml sxml->xml sxml->string)) @@ -123,10 +125,15 @@ port." (acons '*DEFAULT* default-entity-handler entities) entities)) - (define (name->sxml name) + (define (name->sxml name namespaces) (match name ((prefix . local-part) - (symbol-append prefix (string->symbol ":") local-part)) + (let ((abbrev (and=> (find (match-lambda + ((abbrev uri . rest) + (and (eq? uri prefix) abbrev))) + namespaces) + first))) + (symbol-append abbrev (string->symbol ":") local-part))) (_ name))) (define (doctype-continuation seed) @@ -152,14 +159,16 @@ port." (ssax:reverse-collect-str seed))) (attrs (attlist-fold (lambda (attr accum) - (cons (list (name->sxml (car attr)) (cdr attr)) + (cons (list (name->sxml (car attr) namespaces) + (cdr attr)) accum)) '() attributes))) - (acons (name->sxml elem-gi) - (if (null? attrs) - seed - (cons (cons '@ attrs) seed)) - parent-seed))) + (values (acons (name->sxml elem-gi namespaces) + (if (null? attrs) + seed + (cons (cons '@ attrs) seed)) + parent-seed) + namespaces))) CHAR-DATA-HANDLER ; fhere (lambda (string1 string2 seed) @@ -212,7 +221,28 @@ port." (let* ((port (if (string? string-or-port) (open-input-string string-or-port) string-or-port)) - (elements (reverse (parser port '())))) + (elements (call-with-values + (lambda () (parser port '())) + (lambda (elements namespaces) + ;; Generate namespace declarations mapping + ;; abbreviations to URLs. + (let ((ns-declarations + (filter-map (match-lambda + (('*DEFAULT* . _) #f) + ((abbrev uri . _) + (list (symbol-append 'xmlns: abbrev) + (symbol->string uri)))) + namespaces))) + ;; Inject namespace declarations into the first + ;; proper element. + (match (reverse elements) + (((and pi-elem ('*PI* . _)) + (tag ('@ . attrs) . children)) + `(,pi-elem (,tag (@ ,@ns-declarations ,attrs) + ,@children))) + (((tag ('@ . attrs) . children)) + `(,tag (@ ,@ns-declarations ,attrs) + ,@children)))))))) `(*TOP* ,@elements))) (define check-name -- 2.20.1 --=-=-=--