unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* [guile-lib] html->sxml does not decode entities in attributes
@ 2025-02-04 20:58 Tomas Volf
  2025-02-05  4:58 ` Felix Lechner via Developers list for Guile, the GNU extensibility library
  2025-02-05 11:34 ` Dr. Arne Babenhauserheide
  0 siblings, 2 replies; 6+ messages in thread
From: Tomas Volf @ 2025-02-04 20:58 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 1303 bytes --]


Hello,

I think I found a bug in the htmlprag module in guile-lib.  When parsing
attributes, the values are not properly decoded:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,use (htmlprag)
scheme@(guile-user)> (html->sxml "<hr aaa=\"bbb&quot;ccc'ddd\" />")
$1 = (*TOP* (hr (@ (aaa "bbb&quot;ccc'ddd"))))
scheme@(guile-user)> (html->sxml "<a href=\"a&amp;b\" />")
$2 = (*TOP* (a (@ (href "a&amp;b"))))
--8<---------------cut here---------------end--------------->8---

I think that $1 should be "bbb\"ccc'ddd" and $2 should be "a&b".

The annoying part is that this cannot really be changed now, because
people (me included) already have workarounds in place, and
automatically decoding now would lead to double decoding.

I see few ways forward:

1. Document the current behavior and keep it as it is.
2. Add argument #:decode-attributes, defaulting to #f, to the relevant
   procedures, so that people can opt into the fixed behavior.
3. Introduce parameter %decode-attributes, so that people can opt into
   the fixed behavior.

I am sure there are also other approaches possible.

Have a nice day,
Tomas

-- 
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [guile-lib] html->sxml does not decode entities in attributes
  2025-02-04 20:58 [guile-lib] html->sxml does not decode entities in attributes Tomas Volf
@ 2025-02-05  4:58 ` Felix Lechner via Developers list for Guile, the GNU extensibility library
  2025-02-05 11:38   ` Dr. Arne Babenhauserheide
  2025-02-05 18:03   ` Tomas Volf
  2025-02-05 11:34 ` Dr. Arne Babenhauserheide
  1 sibling, 2 replies; 6+ messages in thread
From: Felix Lechner via Developers list for Guile, the GNU extensibility library @ 2025-02-05  4:58 UTC (permalink / raw)
  To: guile-devel

Hi Tomas,

On Tue, Feb 04 2025, Tomas Volf wrote:

> automatically decoding now would lead to double decoding.

Will a second decoding step for HTML entities, which is the most likely
workaround, mess up strings like "a&b" or "bbb\"ccc'ddd" ?

Kind regards
Felix



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [guile-lib] html->sxml does not decode entities in attributes
  2025-02-04 20:58 [guile-lib] html->sxml does not decode entities in attributes Tomas Volf
  2025-02-05  4:58 ` Felix Lechner via Developers list for Guile, the GNU extensibility library
@ 2025-02-05 11:34 ` Dr. Arne Babenhauserheide
  2025-02-05 17:57   ` Tomas Volf
  1 sibling, 1 reply; 6+ messages in thread
From: Dr. Arne Babenhauserheide @ 2025-02-05 11:34 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 2034 bytes --]

Tomas Volf <~@wolfsden.cz> writes:
> I think I found a bug in the htmlprag module in guile-lib.  When parsing
> attributes, the values are not properly decoded:

Thank you for the report!

> --8<---------------cut here---------------start------------->8---
> scheme@(guile-user)> ,use (htmlprag)
> scheme@(guile-user)> (html->sxml "<hr aaa=\"bbb&quot;ccc'ddd\" />")
> $1 = (*TOP* (hr (@ (aaa "bbb&quot;ccc'ddd"))))
> scheme@(guile-user)> (html->sxml "<a href=\"a&amp;b\" />")
> $2 = (*TOP* (a (@ (href "a&amp;b"))))
> --8<---------------cut here---------------end--------------->8---
>
> I think that $1 should be "bbb\"ccc'ddd" and $2 should be "a&b".

The other way round does encode, so the round-trip is broken and this
definitely is a bug:

> ,use (htmlprag)
> (html->sxml "<hr aaa=\"bbb&quot;ccc'ddd\" />")
$1 = (*TOP* (hr (@ (aaa "bbb&quot;ccc'ddd"))))
> (sxml->html '(*TOP* (hr (@ (aaa "bbb&quot;ccc'ddd")))))
$2 = "<hr aaa=\"bbb&quot;ccc'ddd\" />"
> (sxml->html '(*TOP* (hr (@ (aaa "bbb\"ccc'ddd")))))
$3 = "<hr aaa=\"bbb&quot;ccc'ddd\" />"

> (html->sxml (sxml->html '(*TOP* (hr (@ (aaa "bbb\"ccc'ddd"))))))
$4 = (*TOP* (hr (@ (aaa "bbb&quot;ccc'ddd"))))

> I see few ways forward:
>
> 1. Document the current behavior and keep it as it is.
> 2. Add argument #:decode-attributes, defaulting to #f, to the relevant
>    procedures, so that people can opt into the fixed behavior.
> 3. Introduce parameter %decode-attributes, so that people can opt into
>    the fixed behavior.
>
> I am sure there are also other approaches possible.

Since htmlprag already uses parameters for customization
(%strict-tokenizer?), option 3 sounds best to me.

http://git.savannah.nongnu.org/gitweb/?p=guile-lib.git;a=blob;f=src/htmlprag.scm;h=79a7b2f33b0755474bfc015912c01bdf6c676a15;hb=HEAD#l44

(but I’m not the maintainer, so others may have a different opinion)

Can you create a patch?

Best wishes,
Arne
-- 
Unpolitisch sein
heißt politisch sein,
ohne es zu merken.
draketo.de

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 1125 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [guile-lib] html->sxml does not decode entities in attributes
  2025-02-05  4:58 ` Felix Lechner via Developers list for Guile, the GNU extensibility library
@ 2025-02-05 11:38   ` Dr. Arne Babenhauserheide
  2025-02-05 18:03   ` Tomas Volf
  1 sibling, 0 replies; 6+ messages in thread
From: Dr. Arne Babenhauserheide @ 2025-02-05 11:38 UTC (permalink / raw)
  To: Felix Lechner via Developers list for Guile, the GNU extensibility library
  Cc: Felix Lechner

[-- Attachment #1: Type: text/plain, Size: 757 bytes --]

Hi Felix,

Felix Lechner via "Developers list for Guile, the GNU extensibility library" <guile-devel@gnu.org> writes:
> On Tue, Feb 04 2025, Tomas Volf wrote:
>> automatically decoding now would lead to double decoding.
>
> Will a second decoding step for HTML entities, which is the most likely
> workaround, mess up strings like "a&b" or "bbb\"ccc'ddd" ?

Yes, for examplpe when including actually encoded data which then cannot
be provided at all.

data-attributes are nowadays used to provide actual data used via
dataset, so attributes can contain arbitrary data:
https://developer.mozilla.org/en/docs/Web/HTML/Global_attributes/data-*

Best wishes,
Arne
-- 
Unpolitisch sein
heißt politisch sein,
ohne es zu merken.
draketo.de

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 1125 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [guile-lib] html->sxml does not decode entities in attributes
  2025-02-05 11:34 ` Dr. Arne Babenhauserheide
@ 2025-02-05 17:57   ` Tomas Volf
  0 siblings, 0 replies; 6+ messages in thread
From: Tomas Volf @ 2025-02-05 17:57 UTC (permalink / raw)
  To: Dr. Arne Babenhauserheide; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 1240 bytes --]

"Dr. Arne Babenhauserheide" <arne_bab@web.de> writes:

>> I see few ways forward:
>>
>> 1. Document the current behavior and keep it as it is.
>> 2. Add argument #:decode-attributes, defaulting to #f, to the relevant
>>    procedures, so that people can opt into the fixed behavior.
>> 3. Introduce parameter %decode-attributes, so that people can opt into
>>    the fixed behavior.
>>
>> I am sure there are also other approaches possible.
>
> Since htmlprag already uses parameters for customization
> (%strict-tokenizer?), option 3 sounds best to me.
>
> http://git.savannah.nongnu.org/gitweb/?p=guile-lib.git;a=blob;f=src/htmlprag.scm;h=79a7b2f33b0755474bfc015912c01bdf6c676a15;hb=HEAD#l44
>
> (but I’m not the maintainer, so others may have a different opinion)
>
> Can you create a patch?

Probably not.  I have spent 20 minutes staring into the file and do not
really have any idea where to start (ok, probably somewhere around
`scan-attr').  So I cannot really promise I will be able to work on this
(at least not soon), since I assume it will take me long time to figure
out.

Tomas

-- 
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [guile-lib] html->sxml does not decode entities in attributes
  2025-02-05  4:58 ` Felix Lechner via Developers list for Guile, the GNU extensibility library
  2025-02-05 11:38   ` Dr. Arne Babenhauserheide
@ 2025-02-05 18:03   ` Tomas Volf
  1 sibling, 0 replies; 6+ messages in thread
From: Tomas Volf @ 2025-02-05 18:03 UTC (permalink / raw)
  To: Felix Lechner via Developers list for Guile, the GNU extensibility library
  Cc: Felix Lechner

[-- Attachment #1: Type: text/plain, Size: 1031 bytes --]

Felix Lechner via "Developers list for Guile, the GNU extensibility
library" <guile-devel@gnu.org> writes:

> Hi Tomas,
>
> On Tue, Feb 04 2025, Tomas Volf wrote:
>
>> automatically decoding now would lead to double decoding.
>
> Will a second decoding step for HTML entities, which is the most likely
> workaround, mess up strings like "a&b" or "bbb\"ccc'ddd" ?

Sadly yes:

--8<---------------cut here---------------start------------->8---
scheme@(htmlprag)> (html->sxml "<a href=\"a&amp;amp;b\"></a>")
$15 = (*TOP* (a (@ (href "a&amp;amp;b"))))    ; Parsed value
scheme@(htmlprag)> (html->sxml "a&amp;amp;b")
$16 = (*TOP* "a&amp;b")                       ; First decode
scheme@(htmlprag)> (html->sxml "a&amp;b")     
$17 = (*TOP* "a&b")                           ; Second decode
--8<---------------cut here---------------end--------------->8---

So any fix needs to be opt-in.

Tomas

-- 
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-02-05 18:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-04 20:58 [guile-lib] html->sxml does not decode entities in attributes Tomas Volf
2025-02-05  4:58 ` Felix Lechner via Developers list for Guile, the GNU extensibility library
2025-02-05 11:38   ` Dr. Arne Babenhauserheide
2025-02-05 18:03   ` Tomas Volf
2025-02-05 11:34 ` Dr. Arne Babenhauserheide
2025-02-05 17:57   ` Tomas Volf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).