unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Drop toplevel XML-comments in libxml-parse-(xml|html)-region?
@ 2014-10-28 20:36 Ulf Jasper
  2014-11-11 16:28 ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 9+ messages in thread
From: Ulf Jasper @ 2014-10-28 20:36 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 787 bytes --]

Hi,

parse_region from xml.c, which is called by `libxml-parse-xml-region'
and `libxml-parse-html-region', makes some effort to retain top-level
comments in xml documents.  If necessary it adds an artificial node at
the top of the parse tree.  As a consequence one has to check whether
the result contains the "top" node or not (see below for an example).
This behaviour is different from that of `xml-parse-region' (from
xml.el), which just discards the toplevel comments.

Can we make `libxml-parse-(xml|html)-region' consistent with
`xml-parse-region', i.e. can we drop the toplevel xml comments (and
simply call xmlDocGetRootElement)?

Ulf

----------------------------------------------------------------------
Example: Calling (libxml-parse-xml-region (point-min) (point-max)) on


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/xml, Size: 71 bytes --]

<?xml version="1.0" encoding="UTF-8"?>
<foo>bar</foo>
<!--ignore me-->

[-- Attachment #3: Type: text/plain, Size: 85 bytes --]

    
results in

    (top nil (foo nil "bar") (comment nil "ignore me"))

while for


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #4: Type: text/xml, Size: 54 bytes --]

<?xml version="1.0" encoding="UTF-8"?>
<foo>bar</foo>

[-- Attachment #5: Type: text/plain, Size: 72 bytes --]

    
one gets

    (foo nil "bar")

without the artificial node "top".


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Drop toplevel XML-comments in libxml-parse-(xml|html)-region?
  2014-10-28 20:36 Drop toplevel XML-comments in libxml-parse-(xml|html)-region? Ulf Jasper
@ 2014-11-11 16:28 ` Lars Magne Ingebrigtsen
       [not found]   ` <87tx25o8pu.fsf@web.de>
  2014-11-11 21:40   ` Stefan Monnier
  0 siblings, 2 replies; 9+ messages in thread
From: Lars Magne Ingebrigtsen @ 2014-11-11 16:28 UTC (permalink / raw)
  To: Ulf Jasper; +Cc: emacs-devel

Ulf Jasper <ulf.jasper@web.de> writes:

> parse_region from xml.c, which is called by `libxml-parse-xml-region'
> and `libxml-parse-html-region', makes some effort to retain top-level
> comments in xml documents.  If necessary it adds an artificial node at
> the top of the parse tree.  As a consequence one has to check whether
> the result contains the "top" node or not (see below for an example).
> This behaviour is different from that of `xml-parse-region' (from
> xml.el), which just discards the toplevel comments.
>
> Can we make `libxml-parse-(xml|html)-region' consistent with
> `xml-parse-region', i.e. can we drop the toplevel xml comments (and
> simply call xmlDocGetRootElement)?

I have no opinion in this, but this was added to the libxml code to make
it possible to re-generate XML documents as is, which is not possible
with the way `xml-parse-region' discards top-level comments.

So I don't know what the right fix here is.  On the one hand, it is
(perhaps) surprising that comments are preserved (at all, anywhere) in
the structure returned by the parser.  However, stashing data that is to
be further parsed by the HTML engine is a common feature that must be
preserved.

If we preserve comments further down in the DOM, then not preserving
them at the top level seems inconsistent.

But perhaps that inconsistency is fine?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Drop toplevel XML-comments in libxml-parse-(xml|html)-region?
       [not found]   ` <87tx25o8pu.fsf@web.de>
@ 2014-11-11 19:13     ` Lars Magne Ingebrigtsen
  2014-11-11 19:29     ` Ulf Jasper
  1 sibling, 0 replies; 9+ messages in thread
From: Lars Magne Ingebrigtsen @ 2014-11-11 19:13 UTC (permalink / raw)
  To: Ulf Jasper; +Cc: emacs-devel

Ulf Jasper <ulf.jasper@web.de> writes:

> Out of interest:  Why preserve xml/html comments at all except for
> re-generating an XML document?

In HTML it's necessary because you often include things that "aren't
supposed" to be parsed in comments.

Like

<!--
<script>
foo()
</script>
-->

For the JS stuff, it doesn't really matter much to us since we don't
execute it, but when doing web scraping to get data from a page, you
sometimes see commented-out bits that has to be re-parsed as HTML to get
the bit you're looking for.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Drop toplevel XML-comments in libxml-parse-(xml|html)-region?
       [not found]   ` <87tx25o8pu.fsf@web.de>
  2014-11-11 19:13     ` Lars Magne Ingebrigtsen
@ 2014-11-11 19:29     ` Ulf Jasper
  1 sibling, 0 replies; 9+ messages in thread
From: Ulf Jasper @ 2014-11-11 19:29 UTC (permalink / raw)
  To: emacs-devel

[Hit the wrong key: Reply instead of Follow-up]

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> I have no opinion in this, but this was added to the libxml code to make
> it possible to re-generate XML documents as is, which is not possible
> with the way `xml-parse-region' discards top-level comments.
>
> So I don't know what the right fix here is.  On the one hand, it is
> (perhaps) surprising that comments are preserved (at all, anywhere) in
> the structure returned by the parser.  However, stashing data that is to
> be further parsed by the HTML engine is a common feature that must be
> preserved.
>
> If we preserve comments further down in the DOM, then not preserving
> them at the top level seems inconsistent.
>
> But perhaps that inconsistency is fine?

If comments are to be preserved then they should be preserved
everywhere.  Agreed.  So we leave libxml-parse-(xml|html)-region
unchanged.

Out of interest:  Why preserve xml/html comments at all except for
re-generating an XML document?



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Drop toplevel XML-comments in libxml-parse-(xml|html)-region?
  2014-11-11 16:28 ` Lars Magne Ingebrigtsen
       [not found]   ` <87tx25o8pu.fsf@web.de>
@ 2014-11-11 21:40   ` Stefan Monnier
  2014-11-11 21:48     ` Lars Magne Ingebrigtsen
  1 sibling, 1 reply; 9+ messages in thread
From: Stefan Monnier @ 2014-11-11 21:40 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: Ulf Jasper, emacs-devel

> So I don't know what the right fix here is.

How 'bout:
- add an optional argument to request the result be stripped of its comments.
or
- add a function which takes a parse result and strips it of its comments.


        Stefan



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Drop toplevel XML-comments in libxml-parse-(xml|html)-region?
  2014-11-11 21:40   ` Stefan Monnier
@ 2014-11-11 21:48     ` Lars Magne Ingebrigtsen
  2014-11-11 21:52       ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 9+ messages in thread
From: Lars Magne Ingebrigtsen @ 2014-11-11 21:48 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Ulf Jasper, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> How 'bout:
> - add an optional argument to request the result be stripped of its comments.
> or
> - add a function which takes a parse result and strips it of its comments.

Yes, that would make sense.

But the only problematic comment is the top-level one, because that
makes the structure different than if the comment wasn't there.  Perhaps
we could just cheat and push any top-level comments one step down in the
DOM?  I mean, it's gross, but I don't think anybody would actually
notice in real life.

We could even document that that's what it does, and everybody would
have to be satisfied.  >"?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Drop toplevel XML-comments in libxml-parse-(xml|html)-region?
  2014-11-11 21:48     ` Lars Magne Ingebrigtsen
@ 2014-11-11 21:52       ` Lars Magne Ingebrigtsen
  2014-11-12 20:24         ` Ulf Jasper
  0 siblings, 1 reply; 9+ messages in thread
From: Lars Magne Ingebrigtsen @ 2014-11-11 21:52 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Ulf Jasper, emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> But the only problematic comment is the top-level one, because that
> makes the structure different than if the comment wasn't there.  Perhaps
> we could just cheat and push any top-level comments one step down in the
> DOM?  I mean, it's gross, but I don't think anybody would actually
> notice in real life.

It would mean that

<?xml version="1.0" encoding="UTF-8"?>
  <foo>bar</foo>
<!--ignore me-->

would turn into

<?xml version="1.0" encoding="UTF-8"?>
  <foo>bar</foo>
  <!--ignore me-->

where indentation represents where in the DOM the element appears.  Sort
of.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Drop toplevel XML-comments in libxml-parse-(xml|html)-region?
  2014-11-11 21:52       ` Lars Magne Ingebrigtsen
@ 2014-11-12 20:24         ` Ulf Jasper
  2014-11-21 15:49           ` Ulf Jasper
  0 siblings, 1 reply; 9+ messages in thread
From: Ulf Jasper @ 2014-11-12 20:24 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: Stefan Monnier, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1018 bytes --]

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> Lars Magne Ingebrigtsen <larsi@gnus.org> writes:
>
>> But the only problematic comment is the top-level one, because that
>> makes the structure different than if the comment wasn't there.  Perhaps
>> we could just cheat and push any top-level comments one step down in the
>> DOM?  I mean, it's gross, but I don't think anybody would actually
>> notice in real life.
>
> It would mean that
>
> <?xml version="1.0" encoding="UTF-8"?>
>   <foo>bar</foo>
> <!--ignore me-->
>
> would turn into
>
> <?xml version="1.0" encoding="UTF-8"?>
>   <foo>bar</foo>
>   <!--ignore me-->
>
> where indentation represents where in the DOM the element appears.  Sort
> of.

In that case we would get

  (foo nil (comment nil "level 0")
           (comment nil "level 1")
           (bar nil (comment nil "level 2"))))

instead of

  (top nil (comment nil "level 0")
           (foo nil (comment nil "level 1")
                    (bar nil (comment nil "level 2"))))

for this xml


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/xml, Size: 112 bytes --]

<?xml version="1.0" encoding="UTF-8"?>
<!--level 0-->
<foo><!--level 1-->
  <bar><!--level 2-->
  </bar>
</foo>

[-- Attachment #3: Type: text/plain, Size: 232 bytes --]


That would work but seems a bit strange.  I think I would vote for
Stefan's idea to introduce an optional parameter which controls removal
of all comments.  That would result in

  (foo nil (bar nil))

which does not look so bad.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Drop toplevel XML-comments in libxml-parse-(xml|html)-region?
  2014-11-12 20:24         ` Ulf Jasper
@ 2014-11-21 15:49           ` Ulf Jasper
  0 siblings, 0 replies; 9+ messages in thread
From: Ulf Jasper @ 2014-11-21 15:49 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: Stefan Monnier, emacs-devel

Ulf Jasper <ulf.jasper@web.de> writes:

> I think I would vote for Stefan's idea to introduce an optional
> parameter which controls removal of all comments.

I added a new optional parameter 'discard-comments' to
'libxml-parse-(html|xml)-region'.  Just pushed it upstream.

Ulf



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-11-21 15:49 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-28 20:36 Drop toplevel XML-comments in libxml-parse-(xml|html)-region? Ulf Jasper
2014-11-11 16:28 ` Lars Magne Ingebrigtsen
     [not found]   ` <87tx25o8pu.fsf@web.de>
2014-11-11 19:13     ` Lars Magne Ingebrigtsen
2014-11-11 19:29     ` Ulf Jasper
2014-11-11 21:40   ` Stefan Monnier
2014-11-11 21:48     ` Lars Magne Ingebrigtsen
2014-11-11 21:52       ` Lars Magne Ingebrigtsen
2014-11-12 20:24         ` Ulf Jasper
2014-11-21 15:49           ` Ulf Jasper

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).