Eli Zaretskii <eliz@gnu.org> writes:

>> Re appropriate encoding: correct me if I'm wrong (internet), but among
>> the Emacs coding systems, it'd be latin-1.
>
> That depends on what the other end expects.  Does it expect latin-1 in
> this case?

From the point of view of Emacs, I'd say yes: the other end, meaning the
proxy service, expects latin-1. From the service's point of view, it
only speaks byte sequences and doesn't interpret any fields as text [1].
This continues after proxying has commenced; incoming byte sequences are
forwarded verbatim as opaque payloads.

> Does emitting the single byte \330 produce the correct result in this
> case?  Then by all means please use
>
>    (encode-coding-string address 'latin-1)

It does indeed produce the correct result [2], and I've updated the
patch to reflect this. I wasn't sure whether you wanted me to replace
all the vectors in the tests with strings and/or annotate them with
comments explaining the protocol, so I just left them as is for now.

My main concern (based on sheer ignorance) was any possible side effects
that may occur from encode-coding-string setting the variable
last-coding-system-used to latin-1. I investigated a little by stepping
through the subsequent send_process() call and found that the variable's
value as latin-1 appears short lived because it's quickly reassigned to
binary. I tried to demonstrate this in the attached log of my debug
session (and also show that no conversion is performed). Please pardon
my sad debugging skills.

>> Re program on the other end: this would be any program offering a proxy
>> service that speaks the same protocol. Popular ones include tor and ssh.
>> [...]
>
> And those expect Latin-1 encoding in this case?

I'd say yes, insofar as these programs are examples of a proxy service
of the sort mentioned in the first answer above.

Thanks again


[1] Although, in the case of SOCKS 4A/5, non-numeric addresses, i.e.,
domain names, should probably be expressed in characters a resolver can
understand, like the Punycode ASCII subset.

[2] there is one tiny difference in behavior from the previous iteration
of this patch, but it's not worth anyone's time, so I'll just note it
here for the record: when called in the manner shown in the patch,
encode-coding-string silently replaces multibyte characters with spaces.

The only edge case I could think of in which accidentally passing a
multibyte might be harder to debug than a normal typo would be when
hitting an address like ec2-13-56-13-123.us-west-1.compute.amazonaws.com
and accidentally passing 13.256.13.123 (as "\15\u0100\15\173"), which
would be routed to 13.32.13.123 (flickr/cloudflare).

One way to avoid this would be with validation like that performed by
unibyte-string or, alternatively, by purposefully violating the protocol
and sending say, "\15\15{" instead of "\15 \15{" (and thereby triggering
an error response from the service). All in all, this seems unlikely
enough not to warrant special attention.