* url-retrieve and encoding
@ 2024-02-10 19:31 tomas
2024-02-10 19:41 ` Eli Zaretskii
2024-02-10 20:51 ` Tim Landscheidt
0 siblings, 2 replies; 8+ messages in thread
From: tomas @ 2024-02-10 19:31 UTC (permalink / raw)
To: help-gnu-emacs
[-- Attachment #1: Type: text/plain, Size: 1414 bytes --]
Hello, Emacs experts
I'm trying to fetch a Web resource via https with Emacs.
IIUC, url-retrieve (and its sinchronous friend) are the tools for
the job. They work nicely, but they leave me with a unibyte buffer
(confusingly, the line endings are just linefeeds: from the HTTP
specs I'd expected "\r\n")
Is there a canonical way to "make the buffer be UTF-8? (yes I know,
you know that once the "Content-Type" header line arrives, and at
that point you have read a bunch of bytes already, but the header
is supposed to be ASCII anyway).
What I've come up is to take the buffer-substring starting from
after the first empty line to the end, do a "string-as-multibyte"
with that and insert that into a fresh buffer. But that feels
a bit... gross:
(I've chosen a Greek wiktionary page because the results are more
visible):
(defun fetch-one ()
(let ((stuff ""))
(with-current-buffer
(url-retrieve-synchronously
"https://el.wiktionary.org/wiki/μιλώντας")
(goto-char (point-min))
(re-search-forward "^\r?$")
(forward-line)
(setq stuff (buffer-substring (point) (point-max))))
(pop-to-buffer
(get-buffer-create "*results*"))
(erase-buffer)
(insert (string-as-multibyte stuff))))
What is the "right way" to do this?
Thanks for any ideas
--
tomás
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: url-retrieve and encoding
2024-02-10 19:31 url-retrieve and encoding tomas
@ 2024-02-10 19:41 ` Eli Zaretskii
2024-02-10 19:49 ` tomas
2024-02-10 20:51 ` Tim Landscheidt
1 sibling, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2024-02-10 19:41 UTC (permalink / raw)
To: help-gnu-emacs
> Date: Sat, 10 Feb 2024 20:31:02 +0100
> From: <tomas@tuxteam.de>
>
> IIUC, url-retrieve (and its sinchronous friend) are the tools for
> the job. They work nicely, but they leave me with a unibyte buffer
> (confusingly, the line endings are just linefeeds: from the HTTP
> specs I'd expected "\r\n")
(Maybe your Lisp program decoded the EOLs?)
> Is there a canonical way to "make the buffer be UTF-8?
Yes: decode-coding-region.
> What I've come up is to take the buffer-substring starting from
> after the first empty line to the end, do a "string-as-multibyte"
> with that and insert that into a fresh buffer. But that feels
> a bit... gross:
Indeed. Why didn't you try decoding to begin with?
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: url-retrieve and encoding
2024-02-10 19:41 ` Eli Zaretskii
@ 2024-02-10 19:49 ` tomas
2024-02-11 17:49 ` tomas
0 siblings, 1 reply; 8+ messages in thread
From: tomas @ 2024-02-10 19:49 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: help-gnu-emacs
[-- Attachment #1: Type: text/plain, Size: 1031 bytes --]
On Sat, Feb 10, 2024 at 09:41:15PM +0200, Eli Zaretskii wrote:
> > Date: Sat, 10 Feb 2024 20:31:02 +0100
> > From: <tomas@tuxteam.de>
> >
> > IIUC, url-retrieve (and its sinchronous friend) are the tools for
> > the job. They work nicely, but they leave me with a unibyte buffer
> > (confusingly, the line endings are just linefeeds: from the HTTP
> > specs I'd expected "\r\n")
>
> (Maybe your Lisp program decoded the EOLs?)
Thanks, Eli
I posted "my" whole Lisp program in whole (it is basically just
`url-retrieve-synchronously'. But perhaps the source has no
CRs? At least, "curl -I" shows some. Next mystery :)
> > Is there a canonical way to "make the buffer be UTF-8?
>
> Yes: decode-coding-region.
Ahhh -- thanks a bunch for this one! How could I have missed it.
> > (...) But that feels
> > a bit... gross:
>
> Indeed. Why didn't you try decoding to begin with?
Lack of knowledge it seems. My doc-fu let me down. Thanks
for setting me off in the right direction.
Cheers
--
t
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: url-retrieve and encoding
2024-02-10 19:31 url-retrieve and encoding tomas
2024-02-10 19:41 ` Eli Zaretskii
@ 2024-02-10 20:51 ` Tim Landscheidt
2024-02-11 6:30 ` tomas
1 sibling, 1 reply; 8+ messages in thread
From: Tim Landscheidt @ 2024-02-10 20:51 UTC (permalink / raw)
To: tomas; +Cc: help-gnu-emacs
tomas@tuxteam.de wrote:
> I'm trying to fetch a Web resource via https with Emacs.
> IIUC, url-retrieve (and its sinchronous friend) are the tools for
> the job. They work nicely, but they leave me with a unibyte buffer
> (confusingly, the line endings are just linefeeds: from the HTTP
> specs I'd expected "\r\n")
> […]
I don't know if it is the correct way, but I have been using
(the undocumented) url-insert-file-contents for a while now
and it has worked for me very well. I have not tested any
edge cases, though.
Tim
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: url-retrieve and encoding
2024-02-10 20:51 ` Tim Landscheidt
@ 2024-02-11 6:30 ` tomas
0 siblings, 0 replies; 8+ messages in thread
From: tomas @ 2024-02-11 6:30 UTC (permalink / raw)
To: Tim Landscheidt; +Cc: help-gnu-emacs
[-- Attachment #1: Type: text/plain, Size: 710 bytes --]
On Sat, Feb 10, 2024 at 08:51:54PM +0000, Tim Landscheidt wrote:
> tomas@tuxteam.de wrote:
>
> > I'm trying to fetch a Web resource via https with Emacs.
>
> > IIUC, url-retrieve (and its sinchronous friend) are the tools for
> > the job. They work nicely, but they leave me with a unibyte buffer
> > (confusingly, the line endings are just linefeeds: from the HTTP
> > specs I'd expected "\r\n")
>
> > […]
>
> I don't know if it is the correct way, but I have been using
> (the undocumented) url-insert-file-contents for a while now
> and it has worked for me very well. I have not tested any
> edge cases, though.
Thanks, Tim for the hint. I'll give it a spin :-)
Cheers
--
t
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: url-retrieve and encoding
2024-02-10 19:49 ` tomas
@ 2024-02-11 17:49 ` tomas
2024-02-11 19:21 ` Eli Zaretskii
0 siblings, 1 reply; 8+ messages in thread
From: tomas @ 2024-02-11 17:49 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: help-gnu-emacs
[-- Attachment #1: Type: text/plain, Size: 1024 bytes --]
On Sat, Feb 10, 2024 at 08:49:08PM +0100, tomas@tuxteam.de wrote:
> On Sat, Feb 10, 2024 at 09:41:15PM +0200, Eli Zaretskii wrote:
[...]
> > Yes: decode-coding-region.
>
> Ahhh -- thanks a bunch for this one! How could I have missed it.
>
> > > (...) But that feels
> > > a bit... gross:
> >
> > Indeed. Why didn't you try decoding to begin with?
OK, now I can answer this question more precisely: actually, I'd
been there already and was coufused that the function did... nothing.
Now at least I know why: the buffer is unibyte. Its content /is/
utf-8. So if I set the last argument to nil (i.e. decode-in-place),
it replaces region with its utf8 byte sequences -- an identity
operation (unless there is erroneous UTF-8 around). If I give
it a multibyte buffer as last argument things look much better.
But then... I can do things "in buffer" by simply invoking
(toggle-enable-multibyte-characters t). At least, it seems to
work. But... is it a good idea?
Cheers & thanks
--
tomás
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: url-retrieve and encoding
2024-02-11 17:49 ` tomas
@ 2024-02-11 19:21 ` Eli Zaretskii
2024-02-12 5:30 ` tomas
0 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2024-02-11 19:21 UTC (permalink / raw)
To: help-gnu-emacs
> Date: Sun, 11 Feb 2024 18:49:25 +0100
> From: tomas@tuxteam.de
> Cc: help-gnu-emacs@gnu.org
>
> > > Yes: decode-coding-region.
> >
> > Ahhh -- thanks a bunch for this one! How could I have missed it.
> >
> > > > (...) But that feels
> > > > a bit... gross:
> > >
> > > Indeed. Why didn't you try decoding to begin with?
>
> OK, now I can answer this question more precisely: actually, I'd
> been there already and was coufused that the function did... nothing.
>
> Now at least I know why: the buffer is unibyte.
The solution is (quite obviously) not to do that in-place.
Alternatively, you could make the buffer multibyte in advance, but
that's tricky, so I don't recommend that.
> Its content /is/ utf-8.
That's not really 100% accurate, although it's close. If the unibyte
buffer includes byte sequences that are not valid UTF-8, decoding does
change the byte stream in those places.
> But then... I can do things "in buffer" by simply invoking
> (toggle-enable-multibyte-characters t). At least, it seems to
> work. But... is it a good idea?
No. Always call the decode function, never play with
multi-uni-byteness, because the latter will eventually surprise (or
bite) you.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: url-retrieve and encoding
2024-02-11 19:21 ` Eli Zaretskii
@ 2024-02-12 5:30 ` tomas
0 siblings, 0 replies; 8+ messages in thread
From: tomas @ 2024-02-12 5:30 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: help-gnu-emacs
[-- Attachment #1: Type: text/plain, Size: 1827 bytes --]
On Sun, Feb 11, 2024 at 09:21:39PM +0200, Eli Zaretskii wrote:
> > Date: Sun, 11 Feb 2024 18:49:25 +0100
> > From: tomas@tuxteam.de
> > Cc: help-gnu-emacs@gnu.org
> >
> > > > Yes: decode-coding-region.
> > >
> > > Ahhh -- thanks a bunch for this one! How could I have missed it.
> > >
> > > > > (...) But that feels
> > > > > a bit... gross:
> > > >
> > > > Indeed. Why didn't you try decoding to begin with?
> >
> > OK, now I can answer this question more precisely: actually, I'd
> > been there already and was coufused that the function did... nothing.
> >
> > Now at least I know why: the buffer is unibyte.
>
> The solution is (quite obviously) not to do that in-place.
I guessed so, thanks for the clarification.
> Alternatively, you could make the buffer multibyte in advance, but
> that's tricky, so I don't recommend that.
If url-retrieve had a "callback interface", as processes have, with
their filters, then one could arrange things for the decoding to happen
there. Actually, that's what's going on in the background, I guess.
> > Its content /is/ utf-8.
>
> That's not really 100% accurate, although it's close. If the unibyte
> buffer includes byte sequences that are not valid UTF-8, decoding does
> change the byte stream in those places.
Of course, you are right. The HTTP headers /state/ it to be utf-8. It's
like trusting the label on the bottle :-)
> > But then... I can do things "in buffer" by simply invoking
> > (toggle-enable-multibyte-characters t). At least, it seems to
> > work. But... is it a good idea?
>
> No. Always call the decode function, never play with
> multi-uni-byteness, because the latter will eventually surprise (or
> bite) you.
I guessed so. Thanks for your patience (and for helping me learn).
Cheers
--
t
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-02-12 5:30 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-10 19:31 url-retrieve and encoding tomas
2024-02-10 19:41 ` Eli Zaretskii
2024-02-10 19:49 ` tomas
2024-02-11 17:49 ` tomas
2024-02-11 19:21 ` Eli Zaretskii
2024-02-12 5:30 ` tomas
2024-02-10 20:51 ` Tim Landscheidt
2024-02-11 6:30 ` tomas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).