all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* bug#63125: 30.0.50; [BUG] last argument of libxml2-parse-html-region has no effect?
@ 2023-04-27 16:19 Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2023-04-27 17:08 ` Eli Zaretskii
  0 siblings, 1 reply; 8+ messages in thread
From: Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-04-27 16:19 UTC (permalink / raw)
  To: 63125

[I know I'm running a one-month old master.  I will try to reproduce
this issue again within a day with an up-to-date master unless someone
else does it first.  And -Q as well.]

I'm trying out the function `libxml2-parse-html-region' as recommended
by a thread in help-gnu-emacs.  However, I discovered that the last
argument of this function does not help me normalize a relative url.

Reproducer:

Visit the attached toy html file.  I imagine that it is hosted at
"https://example.com/good/day".

Run this snippet:

    (pp (libxml-parse-html-region
         (point-min) (point-max)
         "https://example.com/good/day"))

Compare it with this snippet:

    (pp (libxml-parse-html-region
         (point-min) (point-max)))

What I get is this result for both snippets (which is shown twice, once
"pretty-printed", and once returned as a string):

--8<---------------cut here---------------start------------->8---
(html nil
      (body nil "\n    "
            (a
             ((href . "/hello"))
             "1")
            "\n    "
            (a
             ((href . "../world"))
             "2")
            "\n    "
            (a
             ((href . "good"))
             "3")
            "\n    "
            (a
             ((href . "morning/or/night"))
             "4")
            "\n  "))
--8<---------------cut here---------------end--------------->8---

Notice, that the href values are not normalized: they are copied
verbatim from the original html file.

If I understand the docstring correctly, the last argument of
`libxml2-parse-html-region', when specified as a url string, should be
used as the "base point" of resolving relative paths found within the
html document.  But the <a href=xxx> paths are not resolved at the
moment.

---

In GNU Emacs 30.0.50 (build 1, x86_64-pc-linux-gnu, GTK+ Version
 3.24.37, cairo version 1.17.8) of 2023-03-25 built on ruijie
Repository revision: db7e95531ac36ae842787b6c5f2859d0642c78cc
Repository branch: makepkg
System Description: Arch Linux

Configured using:
 'configure --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/lib
 --localstatedir=/var --mandir=/usr/share/man --with-gameuser=:games
 --with-modules --without-libotf --without-m17n-flt --without-gconf
 --enable-link-time-optimization --with-native-compilation=yes
 --with-xinput2 --with-pgtk --without-xaw3d --with-sound=alsa
 --with-tree-sitter '--program-transform-name=s/\([ec]tags\)/\1.emacs/'
 'CFLAGS=-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions
 -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security
 -fstack-clash-protection -fcf-protection'
 LDFLAGS=-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now'

Configured features:
ACL CAIRO DBUS FREETYPE GIF GLIB GMP GNUTLS GPM GSETTINGS HARFBUZZ JPEG
JSON LCMS2 LIBSYSTEMD LIBXML2 MODULES NATIVE_COMP NOTIFY INOTIFY PDUMPER
PGTK PNG RSVG SECCOMP SOUND SQLITE3 THREADS TIFF TOOLKIT_SCROLL_BARS
TREE_SITTER WEBP XIM GTK3 ZLIB

Important settings:
  value of $LANG: en_US.UTF-8
  value of $XMODIFIERS: @im=fcitx
  locale-coding-system: utf-8-unix

-- 
Best,


RY

[Please note that this mail might go to spam due to some
misconfiguration in my mail server -- still investigating.]





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#63125: 30.0.50; [BUG] last argument of libxml2-parse-html-region has no effect?
  2023-04-27 16:19 bug#63125: 30.0.50; [BUG] last argument of libxml2-parse-html-region has no effect? Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2023-04-27 17:08 ` Eli Zaretskii
  2023-04-28  1:30   ` Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2023-04-27 17:08 UTC (permalink / raw)
  To: Ruijie Yu; +Cc: 63125

> Date: Fri, 28 Apr 2023 00:19:22 +0800
> From:  Ruijie Yu via "Bug reports for GNU Emacs,
>  the Swiss army knife of text editors" <bug-gnu-emacs@gnu.org>
> 
> I'm trying out the function `libxml2-parse-html-region' as recommended
> by a thread in help-gnu-emacs.  However, I discovered that the last
> argument of this function does not help me normalize a relative url.
> 
> Reproducer:
> 
> Visit the attached toy html file.  I imagine that it is hosted at
> "https://example.com/good/day".
> 
> Run this snippet:
> 
>     (pp (libxml-parse-html-region
>          (point-min) (point-max)
>          "https://example.com/good/day"))
> 
> Compare it with this snippet:
> 
>     (pp (libxml-parse-html-region
>          (point-min) (point-max)))
> 
> What I get is this result for both snippets (which is shown twice, once
> "pretty-printed", and once returned as a string):
> 
> --8<---------------cut here---------------start------------->8---
> (html nil
>       (body nil "\n    "
>             (a
>              ((href . "/hello"))
>              "1")
>             "\n    "
>             (a
>              ((href . "../world"))
>              "2")
>             "\n    "
>             (a
>              ((href . "good"))
>              "3")
>             "\n    "
>             (a
>              ((href . "morning/or/night"))
>              "4")
>             "\n  "))
> --8<---------------cut here---------------end--------------->8---
> 
> Notice, that the href values are not normalized: they are copied
> verbatim from the original html file.
> 
> If I understand the docstring correctly, the last argument of
> `libxml2-parse-html-region', when specified as a url string, should be
> used as the "base point" of resolving relative paths found within the
> html document.  But the <a href=xxx> paths are not resolved at the
> moment.

If you look at xml.c, you will see that we just call a libxml function
passing it this URL.  So if anything isn't as expected, the answer is
in libxml, not in Emacs.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#63125: 30.0.50; [BUG] last argument of libxml2-parse-html-region has no effect?
  2023-04-27 17:08 ` Eli Zaretskii
@ 2023-04-28  1:30   ` Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2023-04-28 10:18     ` Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 1 reply; 8+ messages in thread
From: Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-04-28  1:30 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 63125

[-- Attachment #1: Type: text/plain, Size: 2482 bytes --]


Eli Zaretskii <eliz@gnu.org> writes:

>> Date: Fri, 28 Apr 2023 00:19:22 +0800
>> From:  Ruijie Yu via "Bug reports for GNU Emacs,
>>  the Swiss army knife of text editors" <bug-gnu-emacs@gnu.org>
>> 
>> I'm trying out the function `libxml2-parse-html-region' as recommended
>> by a thread in help-gnu-emacs.  However, I discovered that the last
>> argument of this function does not help me normalize a relative url.
>> 
>> Reproducer:
>> 
>> Visit the attached toy html file.  I imagine that it is hosted at
>> "https://example.com/good/day".
>> 
>> Run this snippet:
>> 
>>     (pp (libxml-parse-html-region
>>          (point-min) (point-max)
>>          "https://example.com/good/day"))
>> 
>> Compare it with this snippet:
>> 
>>     (pp (libxml-parse-html-region
>>          (point-min) (point-max)))
>> 
>> What I get is this result for both snippets (which is shown twice, once
>> "pretty-printed", and once returned as a string):
>> 
>> --8<---------------cut here---------------start------------->8---
>> (html nil
>>       (body nil "\n    "
>>             (a
>>              ((href . "/hello"))
>>              "1")
>>             "\n    "
>>             (a
>>              ((href . "../world"))
>>              "2")
>>             "\n    "
>>             (a
>>              ((href . "good"))
>>              "3")
>>             "\n    "
>>             (a
>>              ((href . "morning/or/night"))
>>              "4")
>>             "\n  "))
>> --8<---------------cut here---------------end--------------->8---
>> 
>> Notice, that the href values are not normalized: they are copied
>> verbatim from the original html file.
>> 
>> If I understand the docstring correctly, the last argument of
>> `libxml2-parse-html-region', when specified as a url string, should be
>> used as the "base point" of resolving relative paths found within the
>> html document.  But the <a href=xxx> paths are not resolved at the
>> moment.
>
> If you look at xml.c, you will see that we just call a libxml function
> passing it this URL.  So if anything isn't as expected, the answer is
> in libxml, not in Emacs.

Thank you for pointing that out.  I will take a look at its source in a
day or two.  I am also upgrading it from 2.10.3-2 to 2.10.4-2, and will
see if that changes anything.

If I end up deciding that it is a libxml2 bug, I'll file a bug there and
link to this bug.

For completeness, here attached is the toy html file that I forgot to
attach in my initial report.


[-- Attachment #2: hello.html --]
[-- Type: text/html, Size: 152 bytes --]

[-- Attachment #3: Type: text/plain, Size: 134 bytes --]


-- 
Best,


RY

[Please note that this mail might go to spam due to some
misconfiguration in my mail server -- still investigating.]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#63125: 30.0.50; [BUG] last argument of libxml2-parse-html-region has no effect?
  2023-04-28  1:30   ` Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2023-04-28 10:18     ` Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2023-04-28 10:40       ` bug#63125: 30.0.50; [BUG] last argument of libxml-parse-html-region " Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 1 reply; 8+ messages in thread
From: Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-04-28 10:18 UTC (permalink / raw)
  To: Ruijie Yu; +Cc: Eli Zaretskii, 63125


Ruijie Yu <ruijie@netyu.xyz> writes:
>>
>> If you look at xml.c, you will see that we just call a libxml function
>> passing it this URL.  So if anything isn't as expected, the answer is
>> in libxml, not in Emacs.
>
> Thank you for pointing that out.  I will take a look at its source in a
> day or two.  I am also upgrading it from 2.10.3-2 to 2.10.4-2, and will
> see if that changes anything.

No difference -- as expected.

> If I end up deciding that it is a libxml2 bug, I'll file a bug there and
> link to this bug.

I have filed an issue [1] in libxml2.  We'll see what they say about it.

FTR, [2] is the documentation of the libxml2's htmlReadMemory()
function -- though it does not say much.

[1]: https://gitlab.gnome.org/GNOME/libxml2/-/issues/525
[2]: https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-HTMLparser.html#htmlReadMemory.

-- 
Best,


RY

[Please note that this mail might go to spam due to some
misconfiguration in my mail server -- still investigating.]





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#63125: 30.0.50; [BUG] last argument of libxml-parse-html-region has no effect?
  2023-04-28 10:18     ` Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2023-04-28 10:40       ` Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2023-04-28 11:31         ` Eli Zaretskii
  0 siblings, 1 reply; 8+ messages in thread
From: Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-04-28 10:40 UTC (permalink / raw)
  To: Ruijie Yu; +Cc: Eli Zaretskii, 63125


Ruijie Yu <ruijie@netyu.xyz> writes:

> Ruijie Yu <ruijie@netyu.xyz> writes:
>>>
>>> If you look at xml.c, you will see that we just call a libxml function
>>> passing it this URL.  So if anything isn't as expected, the answer is
>>> in libxml, not in Emacs.
>>
>> Thank you for pointing that out.  I will take a look at its source in a
>> day or two.  I am also upgrading it from 2.10.3-2 to 2.10.4-2, and will
>> see if that changes anything.
>
> No difference -- as expected.
>
>> If I end up deciding that it is a libxml2 bug, I'll file a bug there and
>> link to this bug.
>
> I have filed an issue [1] in libxml2.  We'll see what they say about it.
>
> FTR, [2] is the documentation of the libxml2's htmlReadMemory()
> function -- though it does not say much.
>
> [1]: https://gitlab.gnome.org/GNOME/libxml2/-/issues/525
> [2]: https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-HTMLparser.html#htmlReadMemory.

I just got a response from one of libxml2's maintainers.

It seems that the docstring for `libxml-parse-html-region' is wrong:
this argument has never served the purpose of resolving relative URLs.
It was only used for error messages.  So I suggest that we modify the
docstring of this function and `libxml-parse-xml-region' to reflect this
fact.

I also don't know if, based on this new information, you want to mark
this parameter obsolete.  I see no immediate need, though.

Should I send a patch for the documentation change, or will you do it?

-- 
Best,


RY

[Please note that this mail might go to spam due to some
misconfiguration in my mail server -- still investigating.]





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#63125: 30.0.50; [BUG] last argument of libxml-parse-html-region has no effect?
  2023-04-28 10:40       ` bug#63125: 30.0.50; [BUG] last argument of libxml-parse-html-region " Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2023-04-28 11:31         ` Eli Zaretskii
  2023-04-29  0:58           ` Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2023-04-28 11:31 UTC (permalink / raw)
  To: Ruijie Yu; +Cc: 63125

> From: Ruijie Yu <ruijie@netyu.xyz>
> Cc: Eli Zaretskii <eliz@gnu.org>, 63125@debbugs.gnu.org
> Date: Fri, 28 Apr 2023 18:40:35 +0800
> 
> > I have filed an issue [1] in libxml2.  We'll see what they say about it.
> >
> > FTR, [2] is the documentation of the libxml2's htmlReadMemory()
> > function -- though it does not say much.
> >
> > [1]: https://gitlab.gnome.org/GNOME/libxml2/-/issues/525
> > [2]: https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-HTMLparser.html#htmlReadMemory.
> 
> I just got a response from one of libxml2's maintainers.
> 
> It seems that the docstring for `libxml-parse-html-region' is wrong:
> this argument has never served the purpose of resolving relative URLs.
> It was only used for error messages.  So I suggest that we modify the
> docstring of this function and `libxml-parse-xml-region' to reflect this
> fact.

The response doesn't say much.  What is this "base URL" argument used
for, and why is it named "bas URL"?  What does it mean "used for error
messages"?  And where is the up-to-date and accurate documentation of
this function, which explains what is this argument for?

Without knowing all that, we cannot fix our documentation, let alone
code.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#63125: 30.0.50; [BUG] last argument of libxml-parse-html-region has no effect?
  2023-04-28 11:31         ` Eli Zaretskii
@ 2023-04-29  0:58           ` Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2023-04-29  6:40             ` Eli Zaretskii
  0 siblings, 1 reply; 8+ messages in thread
From: Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-04-29  0:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Lars Ingebrigtsen, 63125


Eli Zaretskii <eliz@gnu.org> writes:

>> From: Ruijie Yu <ruijie@netyu.xyz>
>> Cc: Eli Zaretskii <eliz@gnu.org>, 63125@debbugs.gnu.org
>> Date: Fri, 28 Apr 2023 18:40:35 +0800
>> 
>> > I have filed an issue [1] in libxml2.  We'll see what they say about it.
>> >
>> > FTR, [2] is the documentation of the libxml2's htmlReadMemory()
>> > function -- though it does not say much.
>> >
>> > [1]: https://gitlab.gnome.org/GNOME/libxml2/-/issues/525
>> > [2]:
>> > https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-HTMLparser.html#htmlReadMemory.
>> 
>> I just got a response from one of libxml2's maintainers.
>> 
>> It seems that the docstring for `libxml-parse-html-region' is wrong:
>> this argument has never served the purpose of resolving relative URLs.
>> It was only used for error messages.  So I suggest that we modify the
>> docstring of this function and `libxml-parse-xml-region' to reflect this
>> fact.
>
> The response doesn't say much.  What is this "base URL" argument used
> for, and why is it named "bas URL"?  What does it mean "used for error
> messages"?  And where is the up-to-date and accurate documentation of
> this function, which explains what is this argument for?
>
> Without knowing all that, we cannot fix our documentation, let alone
> code.

The "base-url" is an argument to the Elisp function
`libxml-parse-html-region'.  I added Lars to the CC, who originally
introduced this function according to git-blame, and who may have a
better idea.

The following portion are my impressions, but I'm happy to pass any
questions you still have to the libxml2 devs if you want (or you can
comment there directly in the linked issue on gnome's gitlab instance).

-----

As you pointed out, these arguments of the Elisp function are passed
with minimal transformations and sent to the libxml2 function
`htmlReadMemory()' function.  This C function takes an argument `url',
which is the string `base-url' or empty string if `base-url' is nil.

According to Nick (the libxml2 maintainer) and my interpretation, the
`url' parameter of the libxml2 function is simply stored inside the
`url' field of a `xmlDoc' struct, to be used when an error message needs
to be displayed.  So, the `url' parameter practically does nothing for
us, since we disable all libxml2-level warnings and errors in calling
`htmlReadMemory()'.

I put this url [1] to the issue assuming that it is the documentation,
and Nick doesn't have any comment regarding the url.  So this is
probably the up-to-date, albeit not very elaborate, documentation for
the function.

[1]: https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-HTMLparser.html#htmlReadMemory

-- 
Best,


RY

[Please note that this mail might go to spam due to some
misconfiguration in my mail server -- still investigating.]





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#63125: 30.0.50; [BUG] last argument of libxml-parse-html-region has no effect?
  2023-04-29  0:58           ` Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2023-04-29  6:40             ` Eli Zaretskii
  0 siblings, 0 replies; 8+ messages in thread
From: Eli Zaretskii @ 2023-04-29  6:40 UTC (permalink / raw)
  To: Ruijie Yu; +Cc: larsi, 63125-done

> From: Ruijie Yu <ruijie@netyu.xyz>
> Cc: 63125@debbugs.gnu.org, Lars Ingebrigtsen <larsi@gnus.org>
> Date: Sat, 29 Apr 2023 08:58:03 +0800
> 
> > The response doesn't say much.  What is this "base URL" argument used
> > for, and why is it named "bas URL"?  What does it mean "used for error
> > messages"?  And where is the up-to-date and accurate documentation of
> > this function, which explains what is this argument for?
> >
> > Without knowing all that, we cannot fix our documentation, let alone
> > code.
> 
> The "base-url" is an argument to the Elisp function
> `libxml-parse-html-region'.  I added Lars to the CC, who originally
> introduced this function according to git-blame, and who may have a
> better idea.
> 
> The following portion are my impressions, but I'm happy to pass any
> questions you still have to the libxml2 devs if you want (or you can
> comment there directly in the linked issue on gnome's gitlab instance).
> 
> -----
> 
> As you pointed out, these arguments of the Elisp function are passed
> with minimal transformations and sent to the libxml2 function
> `htmlReadMemory()' function.  This C function takes an argument `url',
> which is the string `base-url' or empty string if `base-url' is nil.
> 
> According to Nick (the libxml2 maintainer) and my interpretation, the
> `url' parameter of the libxml2 function is simply stored inside the
> `url' field of a `xmlDoc' struct, to be used when an error message needs
> to be displayed.  So, the `url' parameter practically does nothing for
> us, since we disable all libxml2-level warnings and errors in calling
> `htmlReadMemory()'.
> 
> I put this url [1] to the issue assuming that it is the documentation,
> and Nick doesn't have any comment regarding the url.  So this is
> probably the up-to-date, albeit not very elaborate, documentation for
> the function.
> 
> [1]: https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-HTMLparser.html#htmlReadMemory

Thanks.  So I've now updated our documentation with this information,
and I'm therefore closing the bug.





^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-04-29  6:40 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-27 16:19 bug#63125: 30.0.50; [BUG] last argument of libxml2-parse-html-region has no effect? Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-04-27 17:08 ` Eli Zaretskii
2023-04-28  1:30   ` Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-04-28 10:18     ` Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-04-28 10:40       ` bug#63125: 30.0.50; [BUG] last argument of libxml-parse-html-region " Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-04-28 11:31         ` Eli Zaretskii
2023-04-29  0:58           ` Ruijie Yu via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-04-29  6:40             ` Eli Zaretskii

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.