To fetch URL, extract <title> element?

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* To fetch URL, extract <title> element?
@ 2020-11-11  9:28 Jean Louis
  2020-11-11 15:27 ` Michael Heerdegen
  0 siblings, 1 reply; 6+ messages in thread
From: Jean Louis @ 2020-11-11  9:28 UTC (permalink / raw)
  To: GNU Emacs Help

What is the standard built-in way to fetch the http[s] URL?

I need to get string to parse <title> and I think something like this
below:

(setq a
      (with-temp-buffer
	(url-retrieve "https://www.gnu.org" 'identity)
	(buffer-string)))


When researching `eww' I find this function here, which is chunk that
makes sure of parsing and punny code in the URL. I do not find it
useful as I cannot easily fetch URL without thinking of those details.

;;;###autoload
(defun eww (url &optional arg buffer)
  "Fetch URL and render the page.
If the input doesn't look like an URL or a domain name, the
word(s) will be searched for via `eww-search-prefix'.

If called with a prefix ARG, use a new buffer instead of reusing
the default EWW buffer.

If BUFFER, the data to be rendered is in that buffer.  In that
case, this function doesn't actually fetch URL.  BUFFER will be
killed after rendering."
  (interactive
   (let ((uris (eww-suggested-uris)))
     (list (read-string (format-prompt "Enter URL or keywords"
                                       (and uris (car uris)))
                        nil 'eww-prompt-history uris)
           (prefix-numeric-value current-prefix-arg))))
  (setq url (eww--dwim-expand-url url))
  (pop-to-buffer-same-window
   (cond
    ((eq arg 4)
     (generate-new-buffer "*eww*"))
    ((eq major-mode 'eww-mode)
     (current-buffer))
    (t
     (get-buffer-create "*eww*"))))
  (eww-setup-buffer)
  ;; Check whether the domain only uses "Highly Restricted" Unicode
  ;; IDNA characters.  If not, transform to punycode to indicate that
  ;; there may be funny business going on.
  (let ((parsed (url-generic-parse-url url)))
    (when (url-host parsed)
      (unless (puny-highly-restrictive-domain-p (url-host parsed))
        (setf (url-host parsed) (puny-encode-domain (url-host parsed)))))
    ;; When the URL is on the form "http://a/../../../g", chop off all
    ;; the leading "/.."s.
    (when (url-filename parsed)
      (while (string-match "\\`/[.][.]/" (url-filename parsed))
        (setf (url-filename parsed) (substring (url-filename parsed) 3))))
    (setq url (url-recreate-url parsed)))
  (plist-put eww-data :url url)
  (plist-put eww-data :title "")
  (eww-update-header-line-format)
  (let ((inhibit-read-only t))
    (insert (format "Loading %s..." url))
    (goto-char (point-min)))
  (let ((url-mime-accept-string eww-accept-content-types))
    (if buffer
        (let ((eww-buffer (current-buffer)))
          (with-current-buffer buffer
            (eww-render nil url nil eww-buffer)))
      (eww-retrieve url #'eww-render
                    (list url nil (current-buffer))))))



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: To fetch URL, extract <title> element?
  2020-11-11  9:28 To fetch URL, extract <title> element? Jean Louis
@ 2020-11-11 15:27 ` Michael Heerdegen
  2020-11-11 18:04   ` Jean Louis
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Heerdegen @ 2020-11-11 15:27 UTC (permalink / raw)
  To: help-gnu-emacs

Jean Louis <bugs@gnu.support> writes:

> What is the standard built-in way to fetch the http[s] URL?

`url-retrieve' sounds appropriate.


> I need to get string to parse <title>> [...]

If I understand what you want correctly, eww seems to get the title with
`eww-tag-title'.


> When researching `eww' I find this function here, which is chunk that
> makes sure of parsing and punny code in the URL. I do not find it
> useful as I cannot easily fetch URL without thinking of those details.
> [...]

Now I couldn't parse this.


Michael.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: To fetch URL, extract <title> element?
  2020-11-11 15:27 ` Michael Heerdegen
@ 2020-11-11 18:04   ` Jean Louis
  2020-11-12 12:56     ` Michael Heerdegen
  2020-11-12 14:49     ` Yuri Khan
  0 siblings, 2 replies; 6+ messages in thread
From: Jean Louis @ 2020-11-11 18:04 UTC (permalink / raw)
  To: Michael Heerdegen; +Cc: help-gnu-emacs

* Michael Heerdegen <michael_heerdegen@web.de> [2020-11-11 18:29]:
> Jean Louis <bugs@gnu.support> writes:
> 
> > What is the standard built-in way to fetch the http[s] URL?
> 
> `url-retrieve' sounds appropriate.

I am trying like this:

(defun wrs-fetch-title (url)
  (url-retrieve url #'wrs-get-title (list url)))

(defun wrs-get-title (status url)
  (message-any status))

(wrs-fetch-title "http://localhost")

At least I get status nil, but I do not know how to get the HTML
text. Yes, I am looking into eww but it is not enlightening.

> If I understand what you want correctly, eww seems to get the title with
> `eww-tag-title'

That somehow sounds easier to do. To get HTML or any text is first
priority.

That will help in Hyperscope to automatically update WWW links with
their titles provided that content-type is HTML.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: To fetch URL, extract <title> element?
  2020-11-11 18:04   ` Jean Louis
@ 2020-11-12 12:56     ` Michael Heerdegen
  2020-11-12 13:20       ` Jean Louis
  2020-11-12 14:49     ` Yuri Khan
  1 sibling, 1 reply; 6+ messages in thread
From: Michael Heerdegen @ 2020-11-12 12:56 UTC (permalink / raw)
  To: Jean Louis; +Cc: help-gnu-emacs

Jean Louis <bugs@gnu.support> writes:

> > If I understand what you want correctly, eww seems to get the title with
> > `eww-tag-title'
>
> That somehow sounds easier to do. To get HTML or any text is first
> priority.

I also only had looked at the eww code.  Maybe Lars wants to help more.

> That will help in Hyperscope to automatically update WWW links with
> their titles provided that content-type is HTML.

I'm curious: what exactly are you doing?  (I don't know Hyperscope but
see that it's easy to find infos about it in the Internet.)


Regards,

Michael.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: To fetch URL, extract <title> element?
  2020-11-12 12:56     ` Michael Heerdegen
@ 2020-11-12 13:20       ` Jean Louis
  0 siblings, 0 replies; 6+ messages in thread
From: Jean Louis @ 2020-11-12 13:20 UTC (permalink / raw)
  To: Michael Heerdegen; +Cc: help-gnu-emacs

* Michael Heerdegen <michael_heerdegen@web.de> [2020-11-12 15:57]:
> Jean Louis <bugs@gnu.support> writes:
> 
> > > If I understand what you want correctly, eww seems to get the title with
> > > `eww-tag-title'
> >
> > That somehow sounds easier to do. To get HTML or any text is first
> > priority.
> 
> I also only had looked at the eww code.  Maybe Lars wants to help
> more.

Some hyperlinks are captured by copy from any browser and inserted
into Emacs.

- As such do not have title or annotation, but they need to
  have. Title has to be fetched automatically. It is expensive
  process. I would like fetching only headers.

- some WWW links expire, their status has to be updated from time to
  time

- then it becomes possible for user to mark hyperlinks and update
  titles for all of them

I do not know how to use url-retrieve but I found out how to use it
synchronoysly and for now this work non-elegantly. 

(defun hyperscope-url-to-string (url)
  "Fetch URL and return as string."
  (url-retrieve-synchronously url)
  (let ((buffer (url-retrieve-synchronously url)))
    (with-current-buffer buffer
      (buffer-string))))

(defun hyperscope-fetch-title (url)
  "Return title for URL or if there is no match URL."
  (let* ((string (hyperscope-url-to-string url))
	 (match (string-match "<title>\\(.*\\)</title>" string)))
    (if match
	(replace-regexp-in-string "<title>\\|</title>" "" (match-string 0 string))
      url)))

(defun hyperscope-fetch-title-for-url (id)
  (let* ((url (hlinks-link id))
	 (title-or-url (hyperscope-fetch-title url)))
    (hlink-update-name-1 title-or-url id)))

(defun hyperscope-update-url-title ()
  (interactive)
  (let ((id (tabulated-list-get-id)))
    (hyperscope-fetch-title-for-url id)))

> > That will help in Hyperscope to automatically update WWW links with
> > their titles provided that content-type is HTML.
> 
> I'm curious: what exactly are you doing?  (I don't know Hyperscope but
> see that it's easy to find infos about it in the Internet.)

It is DKR or Dynamic Knowledge Repository
https://www.dougengelbart.org/content/view/190/163/
https://en.wikipedia.org/wiki/Dynamic_knowledge_repository

Hyperscope is a browsing tool that enables most of the viewing and
navigating features called for in Doug Engelbart's open hyperdocument
system framework (OHS) to support dynamic knowledge repositories
(DKRs) and rising Collective IQ.
https://www.dougengelbart.org/content/view/154/86/

This HyperScope for Emacs is similar to it. It may grow as large index
or it can be used only for bookmarking simple stuff. It is collection
of hyperlinks to anything. Similarly as Emacs bookmarking system it
can hyperlink to any file, file by search or by line number. It does
not work as text as it is database backed.

emacs-libpq dynamic module for PostgreSQL database is coming soon into
GNU ELPA. When this comes then maybe I get some productive version
coming as well.

As result it gives collective IQ or easier access to pieces of
information that a group may need to accelerate its efficiency.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: To fetch URL, extract <title> element?
  2020-11-11 18:04   ` Jean Louis
  2020-11-12 12:56     ` Michael Heerdegen
@ 2020-11-12 14:49     ` Yuri Khan
  1 sibling, 0 replies; 6+ messages in thread
From: Yuri Khan @ 2020-11-12 14:49 UTC (permalink / raw)
  To: Jean Louis; +Cc: Michael Heerdegen, help-gnu-emacs

On Thu, 12 Nov 2020 at 01:05, Jean Louis <bugs@gnu.support> wrote:
> > > What is the standard built-in way to fetch the http[s] URL?
> >
> > `url-retrieve' sounds appropriate.
>
> I am trying like this:
>
> (defun wrs-fetch-title (url)
>   (url-retrieve url #'wrs-get-title (list url)))
>
> (defun wrs-get-title (status url)
>   (message-any status))
>
> (wrs-fetch-title "http://localhost")
>
> At least I get status nil, but I do not know how to get the HTML
> text.

Have you read the docstring of ‘url-retrieve’?

    CALLBACK is called when the object has been completely retrieved, with
    the current buffer containing the object, and any MIME headers associated
    with it.[…]

So probably:

(defun wrs-fetch-title (url)
  (url-retrieve url #'wrs-get-title (list url)))

(defun wrs-get-title (status url)
  (goto-char (point-min))
  (search-forward "\n\n")  ; skip HTTP headers
  (if (search-forward-regexp "<title\\(?:\\s+[^>]*\\)?>\\([^<]*\\)</title>"
                             nil 'noerror)
      (message "URL: %s Title: %s" url (match-string 1))))

(wrs-fetch-title "https://gnu.org/")
⇒ URL: https://gnu.org/ Title: The GNU Operating System and the Free
Software Movement


(For demonstration purposes, I’m overlooking error handling and MIME
type checks. In a real program, you ought to first make sure you got a
successful status, then check that the response you got has a
‘Content-Type’ of either ‘text/html’ or ‘application/xhtml+xml’ (with
possible parameters such as ‘charset’), and only then look for
HTML-specific <title>…</title> tags.)



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-11-12 14:49 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-11  9:28 To fetch URL, extract <title> element? Jean Louis
2020-11-11 15:27 ` Michael Heerdegen
2020-11-11 18:04   ` Jean Louis
2020-11-12 12:56     ` Michael Heerdegen
2020-11-12 13:20       ` Jean Louis
2020-11-12 14:49     ` Yuri Khan

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.