retitle 17958 SHR: base handling broken (shr-parse-base, shr-expand-url) tag 17958 + patch thanks >>>>> Ivan Shmakov writes: […] > However, I believe that the real culprit is shr-expand-url, which > mishandles the nil ‘uri’ case: > (mapcar (lambda (x) (shr-expand-url x "http://example.com/welcome/")) > '("hello" "/world" nil)) > ;; ⇒ > ("http://example.com/welcome/hello" > "http://example.com/world" > "http://example.com") > My expectation for the last result would be the ‘base’ argument > unchanged (i. e., http://example.com/welcome/.) > Thus, I suggest changing shr-expand-url to return not the 0th element > of the (parsed) ‘base’ (see below), but the 3rd. > 596 (cond ((or (not url) > 597 (not base) > 598 (string-match "\\`[a-z]*:" url)) > 599 ;; Absolute URL. > 600 (or url (car base))) > [1] https://tools.wmflabs.org/guc/?user=2001:db8:1337::cafe As it seems, there’s one more issue with SHR “base” handling. Namely, the URI may actually itself be relative, and SHR fails to handle that properly. As per [2]: To set the frozen base URL, resolve the value of the element's href content attribute relative to the Document's fallback base URL; if this is successful, set the frozen base URL to the resulting absolute URL, otherwise, set the frozen base URL to the fallback base URL. The SHR behavior doesn’t match the above. Consider, e. g.: (let ((shr-base (shr-parse-base "http://example.org/"))) (shr-tag-base '((:href . "/relative"))) shr-base) ;; ⇒ ("" "/" nil "/relative") With the patch MIMEd (which also fixes the issue described in my initial bug report), it instead gives what I deem to be the correct result: (let ((shr-base (shr-parse-base "http://example.org/"))) (shr-tag-base '((:href . "/relative"))) shr-base) ;; ⇒ ("http://example.org" "/" "http" "http://example.org/relative") For proper compliance to the specification, SHR should also ignore all the elements but the first one, but I guess that may be fixed separately. The relative URIs appear, e. g., on the Internet Wayback Machine archive pages, when the original page uses the element. [2] http://www.w3.org/TR/html5/document-metadata.html#the-base-element -- FSF associate member #7257 http://boycottsystemd.org/ … 3013 B6A0 230E 334A