url-retrieve and encoding

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* url-retrieve and encoding
@ 2024-02-10 19:31 tomas
  2024-02-10 19:41 ` Eli Zaretskii
  2024-02-10 20:51 ` Tim Landscheidt
  0 siblings, 2 replies; 8+ messages in thread
From: tomas @ 2024-02-10 19:31 UTC (permalink / raw)
  To: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 1414 bytes --]

Hello, Emacs experts

I'm trying to fetch a Web resource via https with Emacs.

IIUC, url-retrieve (and its sinchronous friend) are the tools for
the job. They work nicely, but they leave me with a unibyte buffer
(confusingly, the line endings are just linefeeds: from the HTTP
specs I'd expected "\r\n")

Is there a canonical way to "make the buffer be UTF-8? (yes I know,
you know that once the "Content-Type" header line arrives, and at
that point you have read a bunch of bytes already, but the header
is supposed to be ASCII anyway).

What I've come up is to take the buffer-substring starting from
after the first empty line to the end, do a "string-as-multibyte"
with that and insert that into a fresh buffer. But that feels
a bit... gross:

(I've chosen a Greek wiktionary page because the results are more
visible):

    (defun fetch-one ()
      (let ((stuff ""))
        (with-current-buffer
            (url-retrieve-synchronously
              "https://el.wiktionary.org/wiki/μιλώντας")
          (goto-char (point-min))
          (re-search-forward "^\r?$")
          (forward-line)
          (setq stuff (buffer-substring (point) (point-max))))
        (pop-to-buffer
         (get-buffer-create "*results*"))
        (erase-buffer)
        (insert (string-as-multibyte stuff))))

What is the "right way" to do this?

Thanks for any ideas
-- 
tomás

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: url-retrieve and encoding
  2024-02-10 19:31 url-retrieve and encoding tomas
@ 2024-02-10 19:41 ` Eli Zaretskii
  2024-02-10 19:49   ` tomas
  2024-02-10 20:51 ` Tim Landscheidt
  1 sibling, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2024-02-10 19:41 UTC (permalink / raw)
  To: help-gnu-emacs

> Date: Sat, 10 Feb 2024 20:31:02 +0100
> From: <tomas@tuxteam.de>
> 
> IIUC, url-retrieve (and its sinchronous friend) are the tools for
> the job. They work nicely, but they leave me with a unibyte buffer
> (confusingly, the line endings are just linefeeds: from the HTTP
> specs I'd expected "\r\n")

(Maybe your Lisp program decoded the EOLs?)

> Is there a canonical way to "make the buffer be UTF-8?

Yes: decode-coding-region.

> What I've come up is to take the buffer-substring starting from
> after the first empty line to the end, do a "string-as-multibyte"
> with that and insert that into a fresh buffer. But that feels
> a bit... gross:

Indeed.  Why didn't you try decoding to begin with?



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: url-retrieve and encoding
  2024-02-10 19:41 ` Eli Zaretskii
@ 2024-02-10 19:49   ` tomas
  2024-02-11 17:49     ` tomas
  0 siblings, 1 reply; 8+ messages in thread
From: tomas @ 2024-02-10 19:49 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 1031 bytes --]

On Sat, Feb 10, 2024 at 09:41:15PM +0200, Eli Zaretskii wrote:
> > Date: Sat, 10 Feb 2024 20:31:02 +0100
> > From: <tomas@tuxteam.de>
> > 
> > IIUC, url-retrieve (and its sinchronous friend) are the tools for
> > the job. They work nicely, but they leave me with a unibyte buffer
> > (confusingly, the line endings are just linefeeds: from the HTTP
> > specs I'd expected "\r\n")
> 
> (Maybe your Lisp program decoded the EOLs?)

Thanks, Eli

I posted "my" whole Lisp program in whole (it is basically just
`url-retrieve-synchronously'. But perhaps the source has no
CRs? At least, "curl -I" shows some. Next mystery :)

> > Is there a canonical way to "make the buffer be UTF-8?
> 
> Yes: decode-coding-region.

Ahhh -- thanks a bunch for this one! How could I have missed it.

> > (...) But that feels
> > a bit... gross:
> 
> Indeed.  Why didn't you try decoding to begin with?

Lack of knowledge it seems. My doc-fu let me down. Thanks
for setting me off in the right direction.

Cheers
-- 
t

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: url-retrieve and encoding
  2024-02-10 19:31 url-retrieve and encoding tomas
  2024-02-10 19:41 ` Eli Zaretskii
@ 2024-02-10 20:51 ` Tim Landscheidt
  2024-02-11  6:30   ` tomas
  1 sibling, 1 reply; 8+ messages in thread
From: Tim Landscheidt @ 2024-02-10 20:51 UTC (permalink / raw)
  To: tomas; +Cc: help-gnu-emacs

tomas@tuxteam.de wrote:

> I'm trying to fetch a Web resource via https with Emacs.

> IIUC, url-retrieve (and its sinchronous friend) are the tools for
> the job. They work nicely, but they leave me with a unibyte buffer
> (confusingly, the line endings are just linefeeds: from the HTTP
> specs I'd expected "\r\n")

> […]

I don't know if it is the correct way, but I have been using
(the undocumented) url-insert-file-contents for a while now
and it has worked for me very well.  I have not tested any
edge cases, though.

Tim



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: url-retrieve and encoding
  2024-02-10 20:51 ` Tim Landscheidt
@ 2024-02-11  6:30   ` tomas
  0 siblings, 0 replies; 8+ messages in thread
From: tomas @ 2024-02-11  6:30 UTC (permalink / raw)
  To: Tim Landscheidt; +Cc: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 710 bytes --]

On Sat, Feb 10, 2024 at 08:51:54PM +0000, Tim Landscheidt wrote:
> tomas@tuxteam.de wrote:
> 
> > I'm trying to fetch a Web resource via https with Emacs.
> 
> > IIUC, url-retrieve (and its sinchronous friend) are the tools for
> > the job. They work nicely, but they leave me with a unibyte buffer
> > (confusingly, the line endings are just linefeeds: from the HTTP
> > specs I'd expected "\r\n")
> 
> > […]
> 
> I don't know if it is the correct way, but I have been using
> (the undocumented) url-insert-file-contents for a while now
> and it has worked for me very well.  I have not tested any
> edge cases, though.

Thanks, Tim for the hint. I'll give it a spin :-)

Cheers
-- 
t

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: url-retrieve and encoding
  2024-02-10 19:49   ` tomas
@ 2024-02-11 17:49     ` tomas
  2024-02-11 19:21       ` Eli Zaretskii
  0 siblings, 1 reply; 8+ messages in thread
From: tomas @ 2024-02-11 17:49 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 1024 bytes --]

On Sat, Feb 10, 2024 at 08:49:08PM +0100, tomas@tuxteam.de wrote:
> On Sat, Feb 10, 2024 at 09:41:15PM +0200, Eli Zaretskii wrote:

[...]

> > Yes: decode-coding-region.
> 
> Ahhh -- thanks a bunch for this one! How could I have missed it.
> 
> > > (...) But that feels
> > > a bit... gross:
> > 
> > Indeed.  Why didn't you try decoding to begin with?

OK, now I can answer this question more precisely: actually, I'd
been there already and was coufused that the function did... nothing.

Now at least I know why: the buffer is unibyte. Its content /is/
utf-8. So if I set the last argument to nil (i.e. decode-in-place),
it replaces region with its utf8 byte sequences -- an identity
operation (unless there is erroneous UTF-8 around). If I give
it a multibyte buffer as last argument things look much better.

But then... I can do things "in buffer" by simply invoking
(toggle-enable-multibyte-characters t). At least, it seems to
work. But... is it a good idea?

Cheers & thanks
-- 
tomás

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: url-retrieve and encoding
  2024-02-11 17:49     ` tomas
@ 2024-02-11 19:21       ` Eli Zaretskii
  2024-02-12  5:30         ` tomas
  0 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2024-02-11 19:21 UTC (permalink / raw)
  To: help-gnu-emacs

> Date: Sun, 11 Feb 2024 18:49:25 +0100
> From: tomas@tuxteam.de
> Cc: help-gnu-emacs@gnu.org
> 
> > > Yes: decode-coding-region.
> > 
> > Ahhh -- thanks a bunch for this one! How could I have missed it.
> > 
> > > > (...) But that feels
> > > > a bit... gross:
> > > 
> > > Indeed.  Why didn't you try decoding to begin with?
> 
> OK, now I can answer this question more precisely: actually, I'd
> been there already and was coufused that the function did... nothing.
> 
> Now at least I know why: the buffer is unibyte.

The solution is (quite obviously) not to do that in-place.

Alternatively, you could make the buffer multibyte in advance, but
that's tricky, so I don't recommend that.

> Its content /is/ utf-8.

That's not really 100% accurate, although it's close.  If the unibyte
buffer includes byte sequences that are not valid UTF-8, decoding does
change the byte stream in those places.

> But then... I can do things "in buffer" by simply invoking
> (toggle-enable-multibyte-characters t). At least, it seems to
> work. But... is it a good idea?

No.  Always call the decode function, never play with
multi-uni-byteness, because the latter will eventually surprise (or
bite) you.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: url-retrieve and encoding
  2024-02-11 19:21       ` Eli Zaretskii
@ 2024-02-12  5:30         ` tomas
  0 siblings, 0 replies; 8+ messages in thread
From: tomas @ 2024-02-12  5:30 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 1827 bytes --]

On Sun, Feb 11, 2024 at 09:21:39PM +0200, Eli Zaretskii wrote:
> > Date: Sun, 11 Feb 2024 18:49:25 +0100
> > From: tomas@tuxteam.de
> > Cc: help-gnu-emacs@gnu.org
> > 
> > > > Yes: decode-coding-region.
> > > 
> > > Ahhh -- thanks a bunch for this one! How could I have missed it.
> > > 
> > > > > (...) But that feels
> > > > > a bit... gross:
> > > > 
> > > > Indeed.  Why didn't you try decoding to begin with?
> > 
> > OK, now I can answer this question more precisely: actually, I'd
> > been there already and was coufused that the function did... nothing.
> > 
> > Now at least I know why: the buffer is unibyte.
> 
> The solution is (quite obviously) not to do that in-place.

I guessed so, thanks for the clarification.

> Alternatively, you could make the buffer multibyte in advance, but
> that's tricky, so I don't recommend that.

If url-retrieve had a "callback interface", as processes have, with
their filters, then one could arrange things for the decoding to happen
there. Actually, that's what's going on in the background, I guess.

> > Its content /is/ utf-8.
> 
> That's not really 100% accurate, although it's close.  If the unibyte
> buffer includes byte sequences that are not valid UTF-8, decoding does
> change the byte stream in those places.

Of course, you are right. The HTTP headers /state/ it to be utf-8. It's
like trusting the label on the bottle :-)

> > But then... I can do things "in buffer" by simply invoking
> > (toggle-enable-multibyte-characters t). At least, it seems to
> > work. But... is it a good idea?
> 
> No.  Always call the decode function, never play with
> multi-uni-byteness, because the latter will eventually surprise (or
> bite) you.

I guessed so. Thanks for your patience (and for helping me learn).

Cheers
-- 
t

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-02-12  5:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-10 19:31 url-retrieve and encoding tomas
2024-02-10 19:41 ` Eli Zaretskii
2024-02-10 19:49   ` tomas
2024-02-11 17:49     ` tomas
2024-02-11 19:21       ` Eli Zaretskii
2024-02-12  5:30         ` tomas
2024-02-10 20:51 ` Tim Landscheidt
2024-02-11  6:30   ` tomas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).