Hello!

I just looked at this again and I think I came with something useful.
Here’s some context:

Andy Wingo <wingo@pobox.com> writes:

> Hi :)
>
> On Wed 13 Jul 2016 15:24, tomas@tuxteam.de writes:
>
>> Referring to Oleg Kiseliov's paper [1], there are actually three
>> things involved:
>
> This summary is helpful, thanks.
>> What is missing? From my point of view:
>>
>>  - At xml->sxml time, the user doesn't know which namespaces
>>    are in the xml. So it would be nice if the XML parser
>>    could provide that.
>
> For some documents you do know, of course.
>
> And for larger perspective, I think that SSAX gives you all the tools
> you need to build specialist and very flexible XML parsers.  So to an
> extent solving the general problem isn't necessary -- we can always
> point people to SSAX.  But that's a bit rude ;) so if there are common
> patterns we should try to capture them in xml->sxml.  I see this bug as
> being a search for those patterns, but without the requirement of
> solving the problem in its most general form.
>
>>  - It would be super-nice if the XML parser could put that
>>    into the same nodes it found it, as described in [1]
>>    (i.e. in the (*NAMESPACES* ...) pseudo-attribute).
>>    This way we wouldn't have a global mapping, but one
>>    that resembles the original XML, even with the same
>>    prefixes. Less surprises overall. The round trip
>>    xml -> sxml -> xml would be (nearly) the identity.
>>
>>    With Ricardo's patch it would lump all the namespace
>>    declarations up in the top node, which formally is
>>    correct, but might scare XML people a bit :-)
>
> ACK.
>
>>  - At sxml->xml time there should be a way to somehow
>>    generate prefixex for "new" namespaces. I don't know
>>    at the moment how this would work, that depends on
>>    how the user is supposed to insert new nodes in the
>>    SXML. Does she specify the namespace? Both prefix
>>    (aka namespace-id, under my current assumption) *and*
>>    namespace? (note that the namespace-id/prefix alone
>>    wouldn't be sufficient).
>
> ACK.
>
> What do you think the next step is?  I am happy to wait FWIW, dunno if
> Ricardo has any feelings here.

Attached is a patch that does the requested things.  The parser
procedures like FINISH-ELEMENT have access to all the namespaces, so we
I changed the FINISH-ELEMENT procedure to return the list of namespaces
in addition to its SXML tree return value.

I changed name->sxml to use only the namespace aliases / abbreviations
instead of the namespace URIs.  (This is not very efficient because we
need to traverse the list of namespaces every time.  Maybe we could
memoize this.  On the other hand, the length of the namespaces list may
not be large enough to affect performance too much.)

In the end we get both namespace list and SXML tree from running the
parser.  Before wrapping this up in *TOP* we generate xmlns attributes
for all abbreviations and “patch” the first proper element’s attribute
list (i.e. we skip over a *PI* element if it exists).

The result is an SXML tree that begins with namespace declarations,
mapping abbreviations to URIs.  Within the SXML tree we’re only using
abbreviations, so there are no more invalid characters when converting
SXML to a string.

I would be happy if you could test this as I’m not 100% confident that
this is correct.  Here are questions I wasn’t able to answer
conclusively:

* Is the value for “namespaces” that’s passed in to the
  FINISH-ELEMENT procedure always the same?

* Will the second return value of the final call to FINISH-ELEMENT
  really always be the complete list of *all* namespaces that have been
  encountered?

* Are there valid XML documents for which the match patterns to inject
  namespace declarations would not apply?  (e.g. documents with a PI
  element and two separate XML trees)

--
Ricardo