Decoding URLs input

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Decoding URLs input
@ 2021-07-03  9:40 Jean Louis
  2021-07-03  9:56 ` Jean Louis
  2021-07-03 11:10 ` Yuri Khan
  0 siblings, 2 replies; 8+ messages in thread
From: Jean Louis @ 2021-07-03  9:40 UTC (permalink / raw)
  To: Help GNU Emacs

Hello,

As I am developing Double Opt-In CGI script served by Emacs I am
unsure if this function is correct to be used the encoded strings that
come from URL GET requests, like http://www.example.com/?message=Hello%20There

(rfc2231-decode-encoded-string "Hello%20there") ⇒ "Hello there"

If anybody knows or have clues, let me know. In other programming
languages I have not been thinking of RFC, I don't know which RFC
applies there.


Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

In support of Richard M. Stallman
https://stallmansupport.org/



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Decoding URLs input
  2021-07-03  9:40 Decoding URLs input Jean Louis
@ 2021-07-03  9:56 ` Jean Louis
  2021-07-03 11:10 ` Yuri Khan
  1 sibling, 0 replies; 8+ messages in thread
From: Jean Louis @ 2021-07-03  9:56 UTC (permalink / raw)
  To: Jean Louis; +Cc: Help GNU Emacs

Is it maybe (url-unhex-string query-string)?

I have started using that function, but I am unsure.

-- 
Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

In support of Richard M. Stallman
https://stallmansupport.org/



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Decoding URLs input
  2021-07-03  9:40 Decoding URLs input Jean Louis
  2021-07-03  9:56 ` Jean Louis
@ 2021-07-03 11:10 ` Yuri Khan
  2021-07-03 12:04   ` Jean Louis
  2021-07-03 19:17   ` Jean Louis
  1 sibling, 2 replies; 8+ messages in thread
From: Yuri Khan @ 2021-07-03 11:10 UTC (permalink / raw)
  To: Jean Louis; +Cc: Help GNU Emacs

On Sat, 3 Jul 2021 at 16:41, Jean Louis <bugs@gnu.support> wrote:

> As I am developing Double Opt-In CGI script served by Emacs I am
> unsure if this function is correct to be used the encoded strings that
> come from URL GET requests, like http://www.example.com/?message=Hello%20There
>
> (rfc2231-decode-encoded-string "Hello%20there") ⇒ "Hello there"
>
> If anybody knows or have clues, let me know. In other programming
> languages I have not been thinking of RFC, I don't know which RFC
> applies there.

Why not look at the RFC referenced in order to see whether it is or is
not relevant to your task?

https://datatracker.ietf.org/doc/html/rfc2231

It talks about encoding MIME headers, which is not what you’re dealing
with; and its encoded strings look like
<encoding>'<locale>'<percent-encoded-string>, which is not what you
have.

What you are dealing with is a URL, specifically, its query string
part. These are described in RFC 3986, and its percent-encoding scheme
in sections 2.1 and 2.5.

(url-unhex-string …) will do half the work for you: It will decode
percent-encoded sequences into bytes. By convention, in URLs,
characters are UTF-8-encoded before percent-encoding (see RFC 3986 §
2.5), so you’ll need to use:

    (decode-coding-string (url-unhex-string s) 'utf-8)

to get a fully decoded text string.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Decoding URLs input
  2021-07-03 11:10 ` Yuri Khan
@ 2021-07-03 12:04   ` Jean Louis
  2021-07-03 19:17   ` Jean Louis
  1 sibling, 0 replies; 8+ messages in thread
From: Jean Louis @ 2021-07-03 12:04 UTC (permalink / raw)
  To: Yuri Khan; +Cc: Help GNU Emacs

I appreciate this tip and will test it

On July 3, 2021 11:10:47 AM UTC, Yuri Khan <yuri.v.khan@gmail.com> wrote:
>On Sat, 3 Jul 2021 at 16:41, Jean Louis <bugs@gnu.support> wrote:
>
>> As I am developing Double Opt-In CGI script served by Emacs I am
>> unsure if this function is correct to be used the encoded strings
>that
>> come from URL GET requests, like
>http://www.example.com/?message=Hello%20There
>>
>> (rfc2231-decode-encoded-string "Hello%20there") ⇒ "Hello there"
>>
>> If anybody knows or have clues, let me know. In other programming
>> languages I have not been thinking of RFC, I don't know which RFC
>> applies there.
>
>Why not look at the RFC referenced in order to see whether it is or is
>not relevant to your task?
>
>https://datatracker.ietf.org/doc/html/rfc2231
>
>It talks about encoding MIME headers, which is not what you’re dealing
>with; and its encoded strings look like
><encoding>'<locale>'<percent-encoded-string>, which is not what you
>have.
>
>What you are dealing with is a URL, specifically, its query string
>part. These are described in RFC 3986, and its percent-encoding scheme
>in sections 2.1 and 2.5.
>
>(url-unhex-string …) will do half the work for you: It will decode
>percent-encoded sequences into bytes. By convention, in URLs,
>characters are UTF-8-encoded before percent-encoding (see RFC 3986 §
>2.5), so you’ll need to use:
>
>    (decode-coding-string (url-unhex-string s) 'utf-8)
>
>to get a fully decoded text string.


Jean



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Decoding URLs input
  2021-07-03 11:10 ` Yuri Khan
  2021-07-03 12:04   ` Jean Louis
@ 2021-07-03 19:17   ` Jean Louis
  2021-07-03 20:16     ` Yuri Khan
  1 sibling, 1 reply; 8+ messages in thread
From: Jean Louis @ 2021-07-03 19:17 UTC (permalink / raw)
  To: Yuri Khan; +Cc: Help GNU Emacs

* Yuri Khan <yuri.v.khan@gmail.com> [2021-07-03 14:12]:
> What you are dealing with is a URL, specifically, its query string
> part. These are described in RFC 3986, and its percent-encoding scheme
> in sections 2.1 and 2.5.
> 
> (url-unhex-string …) will do half the work for you: It will decode
> percent-encoded sequences into bytes. By convention, in URLs,
> characters are UTF-8-encoded before percent-encoding (see RFC 3986 §
> 2.5), so you’ll need to use:
> 
>     (decode-coding-string (url-unhex-string s) 'utf-8)
> 
> to get a fully decoded text string.

That is very correct and I have implemented that now. Until now it
worked without `decode-coding-string' and I totally forgot UTF-8. When
I faced the fact that spaces are replaced with plus `+' I started
diggin more. It is not first time to deal with it, who knows which
time and each time I stumble upon UTF-8 handlings, this time you were
one step ahead of me, I have not stumbled upon it and could not
discover what is missing.

From docstring of `url-unhex-string' I did not expect it would give
just bytes back, then that should be IMHO described there, I am not
sure really. Maybe it is assumed for programmer to know that. The
docstring is poor, it says like: "Remove %XX embedded spaces, etc in a
URL." -- with "remove" I don't expect converting UTF-8 into bytes.

I guess now it is clear.

I am now solving the issue that spaces are converted to plus sign and
that I have to convert + signs maybe before:
(decode-coding-string (url-unhex-string "Hello+There") 'utf-8)
but maybe not before, maybe I leave it and convert later.

Problem I have encountered is that library subr.el does not provide
feature 'subr -- and I think I did file report but without
acknowledgment and without seeing it being filed under my email. So I
wait. 

So I cannot use in CGI script the function `string-replace' or as it
asks for that file but I cannot `require' it, as it is not
"provided", so I have to add that line. I would not like really
fiddling on server with main Emacs files.

-- 
Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

In support of Richard M. Stallman
https://stallmansupport.org/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Decoding URLs input
  2021-07-03 19:17   ` Jean Louis
@ 2021-07-03 20:16     ` Yuri Khan
  2021-07-03 22:18       ` Jean Louis
  2021-07-04  4:22       ` Eli Zaretskii
  0 siblings, 2 replies; 8+ messages in thread
From: Yuri Khan @ 2021-07-03 20:16 UTC (permalink / raw)
  To: Yuri Khan, Help GNU Emacs

On Sun, 4 Jul 2021 at 02:20, Jean Louis <bugs@gnu.support> wrote:

> From docstring of `url-unhex-string' I did not expect it would give
> just bytes back, then that should be IMHO described there, I am not
> sure really. Maybe it is assumed for programmer to know that.

I just fed it some percent-encoded sequences that I knew would result
in invalid UTF-8 when decoded. If it were doing a full decode, I
expected it to signal an error. It didn’t.

> The docstring is poor, it says like: "Remove %XX embedded spaces, etc in a
> URL." -- with "remove" I don't expect converting UTF-8 into bytes.

Yeah, that is bad. If I see “remove %xx” in a docstring, I expect
(string= (f "Hello%20World") "HelloWorld").

> I am now solving the issue that spaces are converted to plus sign and
> that I have to convert + signs maybe before:
> (decode-coding-string (url-unhex-string "Hello+There") 'utf-8)
> but maybe not before, maybe I leave it and convert later.

You have to replace them before percent-decoding. If you try it after
percent-decoding, you will not be able to distinguish a + that encodes
a space from a + that you just decoded from %2B. Luckily, spaces never
occur in a valid encoded query string; if they did and had some
meaning, you’d have to decode + *at the same time* as %xx.

Here, have some test cases:

    "Hello+There%7DWorld"   → "Hello There}World"
    "Hello%2BThere%7DWorld" → "Hello+There}World"

By the way, you’re in for some unspecified amount of pain by trying to
implement a web application without a framework. (And by a framework I
mean a library that would give you well-tested means to encode/decode
URL parts, HTTP headers, gzipped request/response bodies, base64,
quoted-printable, application/x-www-form-urlencoded,
multipart/form-data, json, …) CGI is not nearly as simple as it
initially appears to be when you read a hello-cgi-world tutorial.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Decoding URLs input
  2021-07-03 20:16     ` Yuri Khan
@ 2021-07-03 22:18       ` Jean Louis
  2021-07-04  4:22       ` Eli Zaretskii
  1 sibling, 0 replies; 8+ messages in thread
From: Jean Louis @ 2021-07-03 22:18 UTC (permalink / raw)
  To: Yuri Khan; +Cc: Help GNU Emacs

* Yuri Khan <yuri.v.khan@gmail.com> [2021-07-03 23:17]:
> I just fed it some percent-encoded sequences that I knew would result
> in invalid UTF-8 when decoded. If it were doing a full decode, I
> expected it to signal an error. It didn’t.
> 
> > The docstring is poor, it says like: "Remove %XX embedded spaces, etc in a
> > URL." -- with "remove" I don't expect converting UTF-8 into bytes.

> Yeah, that is bad. If I see “remove %xx” in a docstring, I expect
> (string= (f "Hello%20World") "HelloWorld").

Yes. Can you correct that docstring maybe?

> > I am now solving the issue that spaces are converted to plus sign and
> > that I have to convert + signs maybe before:
> > (decode-coding-string (url-unhex-string "Hello+There") 'utf-8)
> > but maybe not before, maybe I leave it and convert later.
> 
> You have to replace them before percent-decoding. If you try it after
> percent-decoding, you will not be able to distinguish a + that encodes
> a space from a + that you just decoded from %2B. Luckily, spaces never
> occur in a valid encoded query string; if they did and had some
> meaning, you’d have to decode + *at the same time* as %xx.

Exactly. I just did not yet get into that analysis and thanks for your
quick one! Now it is clear I have to do it.

> By the way, you’re in for some unspecified amount of pain by trying to
> implement a web application without a framework. (And by a framework I
> mean a library that would give you well-tested means to encode/decode
> URL parts, HTTP headers, gzipped request/response bodies, base64,
> quoted-printable, application/x-www-form-urlencoded,
> multipart/form-data, json, …) CGI is not nearly as simple as it
> initially appears to be when you read a hello-cgi-world tutorial.

Definitely not as simple. Though for the specific need it may be very
compact. There also exist simple Emacs CGI libraries though nearly not
as comprehensive as you mentioned it.

The Double Opt-In already works with cosmetic errors. It will soon be
perfected with these information.

Double Opt-In is to receive subscription, redirect to the subscription
confirmation page that in turn could redirect to sales page or be
sales page or other page; to send email to subscriber to confirm, to
receive confirmation and dispatch the Emacs hash to administrator; to
receive unsubscribe request without any hesitations and dispatch to
administrator; offer to visitor to subscribe again.

I am designing that to be offline just as I have been doing it long
time before, for years. Database is not online. No people's data
should be ever released online. This is for business secret
purposes. One can see that databases leak all the time on
raidforums.com

And practically it works well, it generates relations. I consider it
one of most important scripts.

The old Perl Form script type I have long converted to Common Lisp.

For my specific need:

- encode/decode URL parts is resolved as I only receive URL and
  dispatch simpler confirmation URLs;

- HTTP headers, I just use these and nothing more so far:

(defun rcd-cgi-headers (&optional content-type)
  "Prints basic HTTP headers for HTML"
  (let ((content-type (or content-type "text/html")))
    (princ (format "Content-type: %s\n\n" content-type))))

(defun rcd-cgi-redirect (url)
  "Redirect to URL."
  (princ (concat "Location: " url "\n\n"))
  (unless (eq major-mode 'emacs-lisp-mode)
    (kill-emacs 0)))

- gzipped request/response bodies -- hmm, I think I will not get
  gzipped request, I have no idea right now. Responses definitely not,
  as the only response is either error or redirect to a page. I will
  keep redirect pages in the URL itself to have the script totally on
  its own without anything much hard coded inside.

- I would like to have a line encrypted request. I have used Tiny
  Encryption in Perl and it worked well.

  Do you know any single line encryption for emacs?
  Maybe I can use OpenSSL. I need a stream cipher.
  https://en.wikipedia.org/wiki/Tiny_Encryption_Algorithm#Versions

  I don't know how to use this one, but that maybe what I could use:
  https://github.com/skeeto/emacs-chacha20

  as for URL subscribe requests it is best to have just one string
  encrypted, like doi.cgi?njadsjnasnfdkjsbfbsfbhj

  that nothing is shown to user. That is how I have been doing it
  before with Perl.

  I may simply use external program and encrypt URL requests similar
  like:

  echo '(mid 1 eid 2 cid 3 tile "Subscribe to business")' | openssl enc -ChaCha20 -e -k some -pbkdf2 -base64 -A
  U2FsdGVkX192ic8hOU15mR6zjoYK/rpRA/NkgHohy6eO2A+W8EHuopAigBcc57wKR/sxMYqPV1ESYEY523DS/h0=
  as the idea is to keep the script free of hard coding. It should
  only authorize email addresses that server is in charge of.

  Maybe I can do that with gnutls- functions? I just don't know
  how. That would be better as not to have external dependencies.

- base64 functions exist in Emacs already? Any problem? I will not
  need it.

- quoted-printable - they also exist in Emacs.

- application/x-www-form-urlencoded -- yes, definitely, up

- multipart/form-data, json -- json functions are there in Emacs,
  though I think I will not need it.

-- 
Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

In support of Richard M. Stallman
https://stallmansupport.org/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Decoding URLs input
  2021-07-03 20:16     ` Yuri Khan
  2021-07-03 22:18       ` Jean Louis
@ 2021-07-04  4:22       ` Eli Zaretskii
  1 sibling, 0 replies; 8+ messages in thread
From: Eli Zaretskii @ 2021-07-04  4:22 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Yuri Khan <yuri.v.khan@gmail.com>
> Date: Sun, 4 Jul 2021 03:16:37 +0700
> 
> I just fed it some percent-encoded sequences that I knew would result
> in invalid UTF-8 when decoded. If it were doing a full decode, I
> expected it to signal an error. It didn’t.

That's not a reliable sign that a function returns unibyte strings.
Most Emacs APIs that decode strings don't signal errors if they
encounter invalid sequences; instead, they decode those into raw
bytes.  The design principle is to support raw bytes in strings and
let the application deal with them if they are not expected.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-07-04  4:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-07-03  9:40 Decoding URLs input Jean Louis
2021-07-03  9:56 ` Jean Louis
2021-07-03 11:10 ` Yuri Khan
2021-07-03 12:04   ` Jean Louis
2021-07-03 19:17   ` Jean Louis
2021-07-03 20:16     ` Yuri Khan
2021-07-03 22:18       ` Jean Louis
2021-07-04  4:22       ` Eli Zaretskii

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).