* bug#75998: [guile-lib] html->sxml does not decode entities in attributes
@ 2025-02-01 20:10 Tomas Volf
2025-02-02 6:47 ` tomas
0 siblings, 1 reply; 11+ messages in thread
From: Tomas Volf @ 2025-02-01 20:10 UTC (permalink / raw)
To: 75998
Hello,
I think I found a bug in the htmlprag module in guile-lib. When parsing
attributes, the values are not properly decoded:
--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,use (htmlprag)
scheme@(guile-user)> (html->sxml "<hr aaa=\"bbb"ccc'ddd\" />")
$1 = (*TOP* (hr (@ (aaa "bbb"ccc'ddd"))))
scheme@(guile-user)> (html->sxml "<a href=\"a&b\" />")
$2 = (*TOP* (a (@ (href "a&b"))))
--8<---------------cut here---------------end--------------->8---
I think that $1 should be "bbb\"ccc'ddd" and $2 should be "a&b".
The annoying part is that this cannot really be changed now, because
people (me included) already have workarounds in place, and
automatically decoding now would lead to double decoding.
I see few ways forward:
1. Document the current behavior and keep it as it is.
2. Add argument #:decode-attributes, defaulting to #f, to the relevant
procedures, so that people can opt into the fixed behavior.
3. Introduce parameter %decode-attributes, so that people can opt into
the fixed behavior.
I am sure there are also other approaches possible.
Have a nice day,
Tomas
--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#75998: [guile-lib] html->sxml does not decode entities in attributes
2025-02-01 20:10 bug#75998: [guile-lib] html->sxml does not decode entities in attributes Tomas Volf
@ 2025-02-02 6:47 ` tomas
2025-02-02 9:57 ` Tomas Volf
0 siblings, 1 reply; 11+ messages in thread
From: tomas @ 2025-02-02 6:47 UTC (permalink / raw)
To: Tomas Volf; +Cc: 75998
[-- Attachment #1: Type: text/plain, Size: 1431 bytes --]
On Sat, Feb 01, 2025 at 09:10:04PM +0100, Tomas Volf wrote:
>
> Hello,
>
> I think I found a bug in the htmlprag module in guile-lib. When parsing
> attributes, the values are not properly decoded:
>
> --8<---------------cut here---------------start------------->8---
> scheme@(guile-user)> ,use (htmlprag)
> scheme@(guile-user)> (html->sxml "<hr aaa=\"bbb"ccc'ddd\" />")
> $1 = (*TOP* (hr (@ (aaa "bbb"ccc'ddd"))))
> scheme@(guile-user)> (html->sxml "<a href=\"a&b\" />")
> $2 = (*TOP* (a (@ (href "a&b"))))
> --8<---------------cut here---------------end--------------->8---
>
> I think that $1 should be "bbb\"ccc'ddd" and $2 should be "a&b".
Ouch. Have you contacted Oleg Kiselyov about it? He's usually pretty
responsive and very friendly.
> The annoying part is that this cannot really be changed now, because
> people (me included) already have workarounds in place, and
> automatically decoding now would lead to double decoding.
>
> I see few ways forward:
>
> 1. Document the current behavior and keep it as it is.
> 2. Add argument #:decode-attributes, defaulting to #f, to the relevant
> procedures, so that people can opt into the fixed behavior.
> 3. Introduce parameter %decode-attributes, so that people can opt into
> the fixed behavior.
>
> I am sure there are also other approaches possible.
If it were me, I'd take 2.
Cheers
--
tomás
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#75998: [guile-lib] html->sxml does not decode entities in attributes
2025-02-02 6:47 ` tomas
@ 2025-02-02 9:57 ` Tomas Volf
2025-02-02 21:48 ` David Pirotte
2025-02-03 14:30 ` Maxim Cournoyer
0 siblings, 2 replies; 11+ messages in thread
From: Tomas Volf @ 2025-02-02 9:57 UTC (permalink / raw)
To: tomas; +Cc: 75998
[-- Attachment #1: Type: text/plain, Size: 1294 bytes --]
<tomas@tuxteam.de> writes:
> On Sat, Feb 01, 2025 at 09:10:04PM +0100, Tomas Volf wrote:
>>
>> Hello,
>>
>> I think I found a bug in the htmlprag module in guile-lib. When parsing
>> attributes, the values are not properly decoded:
>>
>> --8<---------------cut here---------------start------------->8---
>> scheme@(guile-user)> ,use (htmlprag)
>> scheme@(guile-user)> (html->sxml "<hr aaa=\"bbb"ccc'ddd\" />")
>> $1 = (*TOP* (hr (@ (aaa "bbb"ccc'ddd"))))
>> scheme@(guile-user)> (html->sxml "<a href=\"a&b\" />")
>> $2 = (*TOP* (a (@ (href "a&b"))))
>> --8<---------------cut here---------------end--------------->8---
>>
>> I think that $1 should be "bbb\"ccc'ddd" and $2 should be "a&b".
>
> Ouch. Have you contacted Oleg Kiselyov about it? He's usually pretty
> responsive and very friendly.
I did not. I did not find a "how to report bugs" section on guile-lib's
website, and on the (htmlprag) documentation section Oleg Kiselyov is
mentioned only in one sentence as a "Thanks".
I think I have managed to find his email in one Haskell paper of his, so
I will CC him on the bug report, as suggested.
Thanks,
Tomas
--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#75998: [guile-lib] html->sxml does not decode entities in attributes
2025-02-02 9:57 ` Tomas Volf
@ 2025-02-02 21:48 ` David Pirotte
2025-02-04 20:55 ` Tomas Volf
2025-02-03 14:30 ` Maxim Cournoyer
1 sibling, 1 reply; 11+ messages in thread
From: David Pirotte @ 2025-02-02 21:48 UTC (permalink / raw)
To: Tomas Volf; +Cc: 75998, tomas
[-- Attachment #1: Type: text/plain, Size: 1080 bytes --]
Hello Thomas,
> I did not. I did not find a "how to report bugs" section on
> guile-lib's website
HACKING
INSTALL
NEWS
README http://git.savannah.nongnu.org/cgit/guile-lib.git/tree/README
all do mention, in their header [HACKING as an example]:
Guile-Lib - HACKING
===========================================
Please send Guile-Lib bug reports to
guile-devel@gnu.org
I'd recommend to close this bug report here saying 'not a guile bug' and
repost on guile-devel.
> and on the (htmlprag) documentation section Oleg Kiselyov is
> mentioned only in one sentence as a "Thanks". I think I have managed
> to find his email in one Haskell paper of his, so I will CC him on
> the bug report, as suggested.
Note and be aware that ther version in guile-lib has been patched
'recently', see commit 84c420769, i Pushed on behalf of Maxim Cournoyer
<maxim.cournoyer@gmail.com>, who's the actual guile-lib maintainer.
David
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#75998: [guile-lib] html->sxml does not decode entities in attributes
2025-02-02 9:57 ` Tomas Volf
2025-02-02 21:48 ` David Pirotte
@ 2025-02-03 14:30 ` Maxim Cournoyer
2025-02-04 21:15 ` Tomas Volf
1 sibling, 1 reply; 11+ messages in thread
From: Maxim Cournoyer @ 2025-02-03 14:30 UTC (permalink / raw)
To: Tomas Volf; +Cc: 75998, tomas
Hi Tomas,
Thank you for reporting this issue.
Tomas Volf <~@wolfsden.cz> writes:
> <tomas@tuxteam.de> writes:
>
>> On Sat, Feb 01, 2025 at 09:10:04PM +0100, Tomas Volf wrote:
>>>
>>> Hello,
>>>
>>> I think I found a bug in the htmlprag module in guile-lib. When parsing
>>> attributes, the values are not properly decoded:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>> scheme@(guile-user)> ,use (htmlprag)
>>> scheme@(guile-user)> (html->sxml "<hr aaa=\"bbb"ccc'ddd\" />")
>>> $1 = (*TOP* (hr (@ (aaa "bbb"ccc'ddd"))))
>>> scheme@(guile-user)> (html->sxml "<a href=\"a&b\" />")
>>> $2 = (*TOP* (a (@ (href "a&b"))))
>>> --8<---------------cut here---------------end--------------->8---
>>>
>>> I think that $1 should be "bbb\"ccc'ddd" and $2 should be "a&b".
>>
>> Ouch. Have you contacted Oleg Kiselyov about it? He's usually pretty
>> responsive and very friendly.
>
> I did not. I did not find a "how to report bugs" section on guile-lib's
> website, and on the (htmlprag) documentation section Oleg Kiselyov is
> mentioned only in one sentence as a "Thanks".
>
> I think I have managed to find his email in one Haskell paper of his, so
> I will CC him on the bug report, as suggested.
And also for containing Oleg. I hope they can provide us with their
opinion on whether this is an actual bug or was designed that way. To
me, it's not clear whether html->sxml should alterate the raw value of
attributes in any way. Users may haev different use cases requiring to
apply different transformation themselves? If we hard-code a decoding
scheme ourselves, then force that choice onto users, no?
--
Thanks,
Maxim
^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#75998: [guile-lib] html->sxml does not decode entities in attributes
2025-02-02 21:48 ` David Pirotte
@ 2025-02-04 20:55 ` Tomas Volf
0 siblings, 0 replies; 11+ messages in thread
From: Tomas Volf @ 2025-02-04 20:55 UTC (permalink / raw)
To: David Pirotte; +Cc: 75998, tomas
[-- Attachment #1: Type: text/plain, Size: 832 bytes --]
David Pirotte <david@altosw.be> writes:
> HACKING
> INSTALL
> NEWS
> README http://git.savannah.nongnu.org/cgit/guile-lib.git/tree/README
>
> all do mention, in their header [HACKING as an example]:
>
> Guile-Lib - HACKING
> ===========================================
>
> Please send Guile-Lib bug reports to
> guile-devel@gnu.org
>
> I'd recommend to close this bug report here saying 'not a guile bug' and
> repost on guile-devel.
Ah, I see. I admit I was checking only the website, and then I asked on
IRC. Will re-post on guile-devel as instructed.
Tomas
--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#75998: [guile-lib] html->sxml does not decode entities in attributes
2025-02-03 14:30 ` Maxim Cournoyer
@ 2025-02-04 21:15 ` Tomas Volf
2025-02-06 14:37 ` Maxim Cournoyer
0 siblings, 1 reply; 11+ messages in thread
From: Tomas Volf @ 2025-02-04 21:15 UTC (permalink / raw)
To: Maxim Cournoyer; +Cc: 75998, tomas
[-- Attachment #1: Type: text/plain, Size: 3604 bytes --]
Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
> Hi Tomas,
>
> Thank you for reporting this issue.
>
> Tomas Volf <~@wolfsden.cz> writes:
>
>> <tomas@tuxteam.de> writes:
>>
>>> On Sat, Feb 01, 2025 at 09:10:04PM +0100, Tomas Volf wrote:
>>>>
>>>> Hello,
>>>>
>>>> I think I found a bug in the htmlprag module in guile-lib. When parsing
>>>> attributes, the values are not properly decoded:
>>>>
>>>> --8<---------------cut here---------------start------------->8---
>>>> scheme@(guile-user)> ,use (htmlprag)
>>>> scheme@(guile-user)> (html->sxml "<hr aaa=\"bbb"ccc'ddd\" />")
>>>> $1 = (*TOP* (hr (@ (aaa "bbb"ccc'ddd"))))
>>>> scheme@(guile-user)> (html->sxml "<a href=\"a&b\" />")
>>>> $2 = (*TOP* (a (@ (href "a&b"))))
>>>> --8<---------------cut here---------------end--------------->8---
>>>>
>>>> I think that $1 should be "bbb\"ccc'ddd" and $2 should be "a&b".
>>>
>>> Ouch. Have you contacted Oleg Kiselyov about it? He's usually pretty
>>> responsive and very friendly.
>>
>> I did not. I did not find a "how to report bugs" section on guile-lib's
>> website, and on the (htmlprag) documentation section Oleg Kiselyov is
>> mentioned only in one sentence as a "Thanks".
>>
>> I think I have managed to find his email in one Haskell paper of his, so
>> I will CC him on the bug report, as suggested.
>
> And also for containing Oleg. I hope they can provide us with their
> opinion on whether this is an actual bug or was designed that way. To
> me, it's not clear whether html->sxml should alterate the raw value of
> attributes in any way.
It already modifies the raw value for regular HTML text:
--8<---------------cut here---------------start------------->8---
scheme@(htmlprag)> (html->sxml "a&b")
$10 = (*TOP* "a&b")
scheme@(htmlprag)> (sxml->html '(*TOP* "a&b"))
$13 = "a&b"
--8<---------------cut here---------------end--------------->8---
I now noticed this also affect encoding:
--8<---------------cut here---------------start------------->8---
scheme@(htmlprag)> (sxml->html '(*TOP* (a (@ (href "a&b")))))
$12 = "<a href=\"a&b\"></a>"
--8<---------------cut here---------------end--------------->8---
I am not sure why attributes should be special here.
For what it is worth, (sxml simple) itself decodes even attributes:
--8<---------------cut here---------------start------------->8---
scheme@(htmlprag)> (xml->sxml "<a href=\"a&b\"></a>")
$11 = (*TOP* (a (@ (href "a&b"))))
--8<---------------cut here---------------end--------------->8---
For comparison, Firefox seems to decode the attributes as well even in
HTML. That is actually how I discovered this issue, links I extracted
from <a href=".."> using html->sxml were not working until I ran a
decoding pass on them.
> Users may haev different use cases requiring to apply different
> transformation themselves?
I agree in the abstract, but do you have any specific use case in mind
when you would want to use the raw content of attributes (especially
since you already cannot get raw content of text nodes).
> If we hard-code a decoding scheme ourselves, then force that choice
> onto users, no?
I agree we cannot hard-code or change it now due to compatibility
concerns, but adding #:decode-attributes to html->sxml,
#:encode-attributes to sxml->html and possibly %deencode-attributes?
parameter, in the spirit of %strict-tokenizer? would seem reasonable.
Tomas
--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#75998: [guile-lib] html->sxml does not decode entities in attributes
2025-02-04 21:15 ` Tomas Volf
@ 2025-02-06 14:37 ` Maxim Cournoyer
2025-02-06 22:34 ` David Pirotte
0 siblings, 1 reply; 11+ messages in thread
From: Maxim Cournoyer @ 2025-02-06 14:37 UTC (permalink / raw)
To: Tomas Volf; +Cc: 75998, tomas
Hi Tomas,
[...]
> It already modifies the raw value for regular HTML text:
>
> scheme@(htmlprag)> (html->sxml "a&b")
> $10 = (*TOP* "a&b")
> scheme@(htmlprag)> (sxml->html '(*TOP* "a&b"))
> $13 = "a&b"
>
>
> I now noticed this also affect encoding:
>
> scheme@(htmlprag)> (sxml->html '(*TOP* (a (@ (href "a&b")))))
> $12 = "<a href=\"a&b\"></a>"
>
>
> I am not sure why attributes should be special here.
>
> For what it is worth, (sxml simple) itself decodes even attributes:
>
> scheme@(htmlprag)> (xml->sxml "<a href=\"a&b\"></a>")
> $11 = (*TOP* (a (@ (href "a&b"))))
>
> For comparison, Firefox seems to decode the attributes as well even in
> HTML. That is actually how I discovered this issue, links I extracted
> from <a href=".."> using html->sxml were not working until I ran a
> decoding pass on them.
Good points. Thanks for these.
>> Users may haev different use cases requiring to apply different
>> transformation themselves?
>
> I agree in the abstract, but do you have any specific use case in mind
> when you would want to use the raw content of attributes (especially
> since you already cannot get raw content of text nodes).
>> If we hard-code a decoding scheme ourselves, then force that choice
>> onto users, no?
>
> I agree we cannot hard-code or change it now due to compatibility
> concerns, but adding #:decode-attributes to html->sxml,
> #:encode-attributes to sxml->html and possibly %deencode-attributes?
> parameter, in the spirit of %strict-tokenizer? would seem reasonable.
I see this situation and %strict-tokenizer as a bit different; the
htmlprag module was designed to be lenient, so being lenient could not
really be considered a bug :-). But this here could well be considered
a bug. So perhaps something we could do is fix this correctly, and bump
at least the minor digit in our version (we're still in an unstable 0
version (last one was 0.2.8.1), so technically we don't promise
stability yet (perhaps never, as this guile-lib project aims to be a lab
for components that could later be included in Guile). But we should
communicate this change well in the NEWS file.
--
Thanks,
Maxim
^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#75998: [guile-lib] html->sxml does not decode entities in attributes
2025-02-06 14:37 ` Maxim Cournoyer
@ 2025-02-06 22:34 ` David Pirotte
2025-02-07 12:47 ` Maxim Cournoyer
0 siblings, 1 reply; 11+ messages in thread
From: David Pirotte @ 2025-02-06 22:34 UTC (permalink / raw)
To: Maxim Cournoyer; +Cc: 75998, Tomas Volf, tomas
[-- Attachment #1: Type: text/plain, Size: 788 bytes --]
Hi Maxim,
Thomas,
> But this here could well be considered a bug. So perhaps something
> we could do is fix this correctly, and bump at least the minor digit
> in our version (we're still in an unstable 0 version (last one was
> 0.2.8.1), so technically we don't promise stability yet (perhaps
> never, as this guile-lib project aims to be a lab for components that
> could later be included in Guile). But we should communicate this
> change well in the NEWS file.
1+ for
a proper fix
bump the version to 0.3.0
well written NEWS entry(ies)
clearly state that the htmlprag module was fixed, in a
way that users who locally applied their own work
around to the fixed problem/bug will have to review
their code and adpat to this new version ...
David
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#75998: [guile-lib] html->sxml does not decode entities in attributes
2025-02-06 22:34 ` David Pirotte
@ 2025-02-07 12:47 ` Maxim Cournoyer
2025-02-09 11:50 ` Tomas Volf
0 siblings, 1 reply; 11+ messages in thread
From: Maxim Cournoyer @ 2025-02-07 12:47 UTC (permalink / raw)
To: David Pirotte; +Cc: 75998, Tomas Volf, tomas
Hi,
David Pirotte <david@altosw.be> writes:
> Hi Maxim,
> Thomas,
>
>> But this here could well be considered a bug. So perhaps something
>> we could do is fix this correctly, and bump at least the minor digit
>> in our version (we're still in an unstable 0 version (last one was
>> 0.2.8.1), so technically we don't promise stability yet (perhaps
>> never, as this guile-lib project aims to be a lab for components that
>> could later be included in Guile). But we should communicate this
>> change well in the NEWS file.
>
> 1+ for
>
> a proper fix
> bump the version to 0.3.0
> well written NEWS entry(ies)
> clearly state that the htmlprag module was fixed, in a
> way that users who locally applied their own work
> around to the fixed problem/bug will have to review
> their code and adpat to this new version ...
Thanks for weighing in.
Tomas, is it a fix you'd be interested in contributing? Otherwise, I'll
get to it but my hands are rather full at the moment :-).
--
Thanks,
Maxim
^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#75998: [guile-lib] html->sxml does not decode entities in attributes
2025-02-07 12:47 ` Maxim Cournoyer
@ 2025-02-09 11:50 ` Tomas Volf
0 siblings, 0 replies; 11+ messages in thread
From: Tomas Volf @ 2025-02-09 11:50 UTC (permalink / raw)
To: Maxim Cournoyer; +Cc: 75998, tomas, David Pirotte
[-- Attachment #1: Type: text/plain, Size: 745 bytes --]
Hi,
Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
> Tomas, is it a fix you'd be interested in contributing? Otherwise, I'll
> get to it but my hands are rather full at the moment :-).
To quote myself from the other thread:
> Probably not. I have spent 20 minutes staring into the file and do not
> really have any idea where to start (ok, probably somewhere around
> `scan-attr'). So I cannot really promise I will be able to work on this
> (at least not soon), since I assume it will take me long time to figure
> out.
So I do not have any immediate plans to start working on this. :/
Tomas
--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-02-09 11:50 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-01 20:10 bug#75998: [guile-lib] html->sxml does not decode entities in attributes Tomas Volf
2025-02-02 6:47 ` tomas
2025-02-02 9:57 ` Tomas Volf
2025-02-02 21:48 ` David Pirotte
2025-02-04 20:55 ` Tomas Volf
2025-02-03 14:30 ` Maxim Cournoyer
2025-02-04 21:15 ` Tomas Volf
2025-02-06 14:37 ` Maxim Cournoyer
2025-02-06 22:34 ` David Pirotte
2025-02-07 12:47 ` Maxim Cournoyer
2025-02-09 11:50 ` Tomas Volf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).