unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#49449: 28: TLS connection never gets to "open" stage
@ 2021-07-06 19:12 Mattias Engdegård
  2021-07-07 19:57 ` Lars Ingebrigtsen
  0 siblings, 1 reply; 26+ messages in thread
From: Mattias Engdegård @ 2021-07-06 19:12 UTC (permalink / raw)
  To: 49449; +Cc: Lars Ingebrigtsen

It seems that the process/TLS apparatus can get into a state where it is unable to ever call the process sentinel "open\n" event; specifically, an async `url-https` does not make forward progress until the connection times out.

This has been observed on macOS in Emacs 28 now and then, and it looks like there is an invariant violation somewhere. To recap:

The "open\n" event is sent to the sentinel in either of two places:

(A) In finish_after_tls_connection, after having successfully called nsm-verify-connection and the condition

  (fd_callback_info[p->outfd].flags & NON_BLOCKING_CONNECT_FD) == 0

being satisfied (process.c:3277).

(B) In wait_reading_process_output, after the descriptor being found writable by `select` and the condition

    NILP (p->gnutls_boot_parameters) && !p->gnutls_p

being satisfied (process.c:5900).

There seems to be a gap in the logic, however: it is perfectly possible for the condition in (A) to fail because the descriptor is still marked nonblocking at that point, and for (B) to fail because gnutls_p=true was set already in gnutls_try_handshake.

Lars, it looks like you wrote at least part of the original logic. Can you see what is going on? It is somewhat complex.

For reference, I'm using the reproduction recipe below; it may or may not exhibit the problem in your particular setup. I'm using gnutls 3.6.15.

(defun busy-wait (s)
  (let ((t0 (current-time)))
    (while (< (time-to-seconds (time-since t0)) s) nil)))

(progn
  (url-http
   #s(url "https" nil nil "elpa.gnu.org" nil "/packages/archive-contents" nil nil t silent t t)
   (lambda (status) (message "callback: status = %S" status))
   '(nil) nil 'tls)
  (busy-wait 1.0))







^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-06 19:12 bug#49449: 28: TLS connection never gets to "open" stage Mattias Engdegård
@ 2021-07-07 19:57 ` Lars Ingebrigtsen
  2021-07-08  7:59   ` Mattias Engdegård
  0 siblings, 1 reply; 26+ messages in thread
From: Lars Ingebrigtsen @ 2021-07-07 19:57 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 49449

Mattias Engdegård <mattiase@acm.org> writes:

> (B) In wait_reading_process_output, after the descriptor being found writable by `select` and the condition
>
>     NILP (p->gnutls_boot_parameters) && !p->gnutls_p
>
> being satisfied (process.c:5900).
>
> There seems to be a gap in the logic, however: it is perfectly
> possible for the condition in (A) to fail because the descriptor is
> still marked nonblocking at that point, and for (B) to fail because
> gnutls_p=true was set already in gnutls_try_handshake.
>
> Lars, it looks like you wrote at least part of the original logic. Can
> you see what is going on? It is somewhat complex.

Yes, it's grown somewhat organically.  :-/

> For reference, I'm using the reproduction recipe below; it may or may not exhibit the problem in your particular setup. I'm using gnutls 3.6.15.
>
> (defun busy-wait (s)
>   (let ((t0 (current-time)))
>     (while (< (time-to-seconds (time-since t0)) s) nil)))
>
> (progn
>   (url-http
>    #s(url "https" nil nil "elpa.gnu.org" nil "/packages/archive-contents" nil nil t silent t t)
>    (lambda (status) (message "callback: status = %S" status))
>    '(nil) nil 'tls)
>   (busy-wait 1.0))

I'm not able to reproduce this on Debian/bullseye, but on Macos I get

callback: status = (:error (error connection-failed "connect" :host "elpa.gnu.o\
rg" :service 443))

after a while.  There's been several reports in the last week of TLS not
working on Macos.  Has Apple pushed something new, or...  did something
else happen lately in this area on Macos?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-07 19:57 ` Lars Ingebrigtsen
@ 2021-07-08  7:59   ` Mattias Engdegård
  2021-07-08 12:54     ` Lars Ingebrigtsen
  0 siblings, 1 reply; 26+ messages in thread
From: Mattias Engdegård @ 2021-07-08  7:59 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 49449

7 juli 2021 kl. 21.57 skrev Lars Ingebrigtsen <larsi@gnus.org>:

> Yes, it's grown somewhat organically.  :-/

Let me first say that the state of the code is not your fault! It's a product, as you say, from organic growth, and it does need a rewrite.

> I'm not able to reproduce this on Debian/bullseye, but on Macos I get
> 
> callback: status = (:error (error connection-failed "connect" :host "elpa.gnu.o\
> rg" :service 443))

Yes, that is my observation too. Obviously the busy-wait part is essential: removing it makes the problem go away.
Essentially, the busy-wait postpones the call to wait_reading_process_output so that when it is eventually called, gnutls_handshake succeeds on the first try instead of first returning GNUTLS_E_AGAIN, which brings us onto a different code path.

> There's been several reports in the last week of TLS not
> working on Macos.  Has Apple pushed something new, or...  did something
> else happen lately in this area on Macos?

No, I've been harassed by this bug for quite some time but only now decided to dig deeper. Most likely it's just a matter of different timing that the process/TLS system doesn't cope with.

First, when the `url-http` call returns we have a Lisp_Process with

 gnutls_p = true
 gnutls_boot_parameters = non-nil
 gnutls_initstage = GNUTLS_STAGE_HANDSHAKE_TRIED (8)

and its file descriptor has a corresponding fd_callback_data with
 flags = FOR_WRITE | NON_BLOCKING_CONNECT_FD

because the asynchronous connect call has not yet been completed.

In the GOOD case (without busy-wait), `wait_reading_process_output` gets called right away (because Emacs has nothing else to do) and gnutls_try_handshake initially fails with E_AGAIN but p->outfd becomes writable so `delete_write_fd` is called to zero the fd_callback_data flags, and when the handshake eventually succeeds, the sentinel is called with the "open\n" event.

In the BAD case (with busy-wait), the TLS handshake succeeds right away while the descriptor flags still has NON_BLOCKING_CONNECT_FD set, so the sentinel isn't called.

Does this jog any memories?






^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-08  7:59   ` Mattias Engdegård
@ 2021-07-08 12:54     ` Lars Ingebrigtsen
  2021-07-08 16:47       ` Mattias Engdegård
  0 siblings, 1 reply; 26+ messages in thread
From: Lars Ingebrigtsen @ 2021-07-08 12:54 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 49449

Mattias Engdegård <mattiase@acm.org> writes:

> In the BAD case (with busy-wait), the TLS handshake succeeds right
> away while the descriptor flags still has NON_BLOCKING_CONNECT_FD set,
> so the sentinel isn't called.
>
> Does this jog any memories?

Sorry, nope.  :-/

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-08 12:54     ` Lars Ingebrigtsen
@ 2021-07-08 16:47       ` Mattias Engdegård
  2021-07-10 16:27         ` Lars Ingebrigtsen
  0 siblings, 1 reply; 26+ messages in thread
From: Mattias Engdegård @ 2021-07-08 16:47 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 49449

[-- Attachment #1: Type: text/plain, Size: 434 bytes --]

8 juli 2021 kl. 14.54 skrev Lars Ingebrigtsen <larsi@gnus.org>:

> Sorry, nope.  :-/

I was thinking about 9c74f2fea6bf which introduced the condition of the wait mask, and maybe ac6e085cf6b2 where async TLS negotiation was added.

A hack like the one below would "solve" the problem by pretending that the first handshake attempt always fails with E_AGAIN, but it would be better to do it in a slightly more principled way.


[-- Attachment #2: tls-handshake-hack.diff --]
[-- Type: application/octet-stream, Size: 984 bytes --]

diff --git a/src/gnutls.c b/src/gnutls.c
index 22e7f2cbc1..47ac679c5c 100644
--- a/src/gnutls.c
+++ b/src/gnutls.c
@@ -623,13 +623,23 @@ gnutls_try_handshake (struct Lisp_Process *proc)
   if (non_blocking)
     proc->gnutls_p = true;
 
-  while ((ret = gnutls_handshake (state)) < 0)
+  if (non_blocking && proc->gnutls_handshakes_tried < 1)
     {
-      if (emacs_gnutls_handle_error (state, ret) == 0) /* fatal */
-	break;
+      /* HACK: don't succeed the first time for nonblocking connections,
+	 because the logic doesn't allow it (bug#49449).  */
+      ret = GNUTLS_E_AGAIN;
       maybe_quit ();
-      if (non_blocking && ret != GNUTLS_E_INTERRUPTED)
-	break;
+    }
+  else
+    {
+      while ((ret = gnutls_handshake (state)) < 0)
+	{
+	  if (emacs_gnutls_handle_error (state, ret) == 0) /* fatal */
+	    break;
+	  maybe_quit ();
+	  if (non_blocking && ret != GNUTLS_E_INTERRUPTED)
+	    break;
+	}
     }
 
   proc->gnutls_initstage = GNUTLS_STAGE_HANDSHAKE_TRIED;

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-08 16:47       ` Mattias Engdegård
@ 2021-07-10 16:27         ` Lars Ingebrigtsen
  2021-07-10 16:51           ` Mattias Engdegård
  0 siblings, 1 reply; 26+ messages in thread
From: Lars Ingebrigtsen @ 2021-07-10 16:27 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 49449

Mattias Engdegård <mattiase@acm.org> writes:

> A hack like the one below would "solve" the problem by pretending that
> the first handshake attempt always fails with E_AGAIN, but it would be
> better to do it in a slightly more principled way.

Hm...  does reverting 234bf1b6363a3d5db8e73c422d87a0bf1aa4b2e3 help any
here?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-10 16:27         ` Lars Ingebrigtsen
@ 2021-07-10 16:51           ` Mattias Engdegård
  2021-07-10 16:57             ` Lars Ingebrigtsen
  0 siblings, 1 reply; 26+ messages in thread
From: Mattias Engdegård @ 2021-07-10 16:51 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 49449

10 juli 2021 kl. 18.27 skrev Lars Ingebrigtsen <larsi@gnus.org>:

> Hm...  does reverting 234bf1b6363a3d5db8e73c422d87a0bf1aa4b2e3 help any
> here?

No, sorry. (Was worth a shot, though!)






^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-10 16:51           ` Mattias Engdegård
@ 2021-07-10 16:57             ` Lars Ingebrigtsen
  2021-07-10 17:07               ` Mattias Engdegård
  2021-07-10 18:23               ` Mattias Engdegård
  0 siblings, 2 replies; 26+ messages in thread
From: Lars Ingebrigtsen @ 2021-07-10 16:57 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 49449

Mattias Engdegård <mattiase@acm.org> writes:

> 10 juli 2021 kl. 18.27 skrev Lars Ingebrigtsen <larsi@gnus.org>:
>
>> Hm...  does reverting 234bf1b6363a3d5db8e73c422d87a0bf1aa4b2e3 help any
>> here?
>
> No, sorry. (Was worth a shot, though!)

Darn!  Well, back to your patch:

-  while ((ret = gnutls_handshake (state)) < 0)
+  if (non_blocking && proc->gnutls_handshakes_tried < 1)
     {
-      if (emacs_gnutls_handle_error (state, ret) == 0) /* fatal */
-	break;
+      /* HACK: don't succeed the first time for nonblocking connections,
+	 because the logic doesn't allow it (bug#49449).  */
+      ret = GNUTLS_E_AGAIN;

I'm not super enthusiastic about this -- it'll artificially make
connections slower?  The problem only seems to occur on Macos, so
it'd be interesting to find out why it doesn't happen on Linux...

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-10 16:57             ` Lars Ingebrigtsen
@ 2021-07-10 17:07               ` Mattias Engdegård
  2021-07-10 18:23               ` Mattias Engdegård
  1 sibling, 0 replies; 26+ messages in thread
From: Mattias Engdegård @ 2021-07-10 17:07 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 49449

10 juli 2021 kl. 18.57 skrev Lars Ingebrigtsen <larsi@gnus.org>:

> I'm not super enthusiastic about this -- it'll artificially make
> connections slower?  The problem only seems to occur on Macos, so
> it'd be interesting to find out why it doesn't happen on Linux...

I'm not fond about it either; it's just a hack. I'm not sure if it makes connections slower or not (if so, it's 100 ms).

We really should seek a sound understanding of the bug so that a proper solution can be found. However the patch is important in that it does seem to work (and consistently so) and could assist us in developing a theory.






^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-10 16:57             ` Lars Ingebrigtsen
  2021-07-10 17:07               ` Mattias Engdegård
@ 2021-07-10 18:23               ` Mattias Engdegård
  2021-07-10 18:54                 ` Eli Zaretskii
                                   ` (2 more replies)
  1 sibling, 3 replies; 26+ messages in thread
From: Mattias Engdegård @ 2021-07-10 18:23 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 49449

>  The problem only seems to occur on Macos, so
> it'd be interesting to find out why it doesn't happen on Linux...

Now I've had access to a Linux machine, and it seems that it works because it has getaddrinfo_a (async DNS lookup) so the path taken is different. If HAVE_GETADDRINFO_A is #undef'ed, it seems to behave like macOS and fail in the same way if busy-waiting for one second after the url-http call. Can you confirm?






^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-10 18:23               ` Mattias Engdegård
@ 2021-07-10 18:54                 ` Eli Zaretskii
  2021-07-10 19:22                   ` Mattias Engdegård
  2021-07-10 20:05                 ` Mattias Engdegård
  2021-07-11 11:29                 ` Lars Ingebrigtsen
  2 siblings, 1 reply; 26+ messages in thread
From: Eli Zaretskii @ 2021-07-10 18:54 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 49449, larsi

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 10 Jul 2021 20:23:33 +0200
> Cc: 49449@debbugs.gnu.org
> 
> >  The problem only seems to occur on Macos, so
> > it'd be interesting to find out why it doesn't happen on Linux...
> 
> Now I've had access to a Linux machine, and it seems that it works because it has getaddrinfo_a (async DNS lookup) so the path taken is different. If HAVE_GETADDRINFO_A is #undef'ed, it seems to behave like macOS and fail in the same way if busy-waiting for one second after the url-http call. Can you confirm?

The busy-wait loop can delay the sentinel call, but do you understand
why the sentinel isn't called once the loop is over?





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-10 18:54                 ` Eli Zaretskii
@ 2021-07-10 19:22                   ` Mattias Engdegård
  2021-07-10 19:31                     ` Eli Zaretskii
  0 siblings, 1 reply; 26+ messages in thread
From: Mattias Engdegård @ 2021-07-10 19:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 49449, larsi

10 juli 2021 kl. 20.54 skrev Eli Zaretskii <eliz@gnu.org>:

> The busy-wait loop can delay the sentinel call, but do you understand
> why the sentinel isn't called once the loop is over?

Sort of; see previous discussion. In short: once the busy-wait loop is over, the TLS handshake succeeds immediately (because sufficient time has passed) but the logic isn't set up for this path and enters a state where the sentinel cannot be called.






^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-10 19:22                   ` Mattias Engdegård
@ 2021-07-10 19:31                     ` Eli Zaretskii
  2021-07-10 19:44                       ` Mattias Engdegård
  0 siblings, 1 reply; 26+ messages in thread
From: Eli Zaretskii @ 2021-07-10 19:31 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 49449, larsi

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 10 Jul 2021 21:22:09 +0200
> Cc: larsi@gnus.org, 49449@debbugs.gnu.org
> 
> 10 juli 2021 kl. 20.54 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > The busy-wait loop can delay the sentinel call, but do you understand
> > why the sentinel isn't called once the loop is over?
> 
> Sort of; see previous discussion. In short: once the busy-wait loop is over, the TLS handshake succeeds immediately (because sufficient time has passed) but the logic isn't set up for this path and enters a state where the sentinel cannot be called.

That's my question: why cannot the sentinel be called in this case?
what prevents it from being called?  Apologies if I missed the
explanation, and could you please repeat it?





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-10 19:31                     ` Eli Zaretskii
@ 2021-07-10 19:44                       ` Mattias Engdegård
  2021-07-11  6:49                         ` Eli Zaretskii
  0 siblings, 1 reply; 26+ messages in thread
From: Mattias Engdegård @ 2021-07-10 19:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 49449, larsi

10 juli 2021 kl. 21.31 skrev Eli Zaretskii <eliz@gnu.org>:

> That's my question: why cannot the sentinel be called in this case?
> what prevents it from being called?

In the failing case, wait_reading_process_output calls gnutls_try_handshake early on, which succeeds and this leads to finish_after_tls_connection being called. Here, we have the condition

  else if ((fd_callback_info[p->outfd].flags & NON_BLOCKING_CONNECT_FD) == 0)

which gates further progress, but this condition is false because the flags have NON_BLOCKING_CONNECT_FD set.

In the successful case, the first call to gnutls_try_handshake from wait_reading_process_output fails because things haven't had the time to be set up yet. This leads to a select being called on the socket for writing (since it's in a nonblocking connect), and when ready, the NON_BLOCKING_CONNECT_FD bit is cleared from the flags.

This is a simplified view. The state is clearly more complex and things need to be done in the proper order.






^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-10 18:23               ` Mattias Engdegård
  2021-07-10 18:54                 ` Eli Zaretskii
@ 2021-07-10 20:05                 ` Mattias Engdegård
  2021-07-11 11:31                   ` Lars Ingebrigtsen
  2021-07-11 11:29                 ` Lars Ingebrigtsen
  2 siblings, 1 reply; 26+ messages in thread
From: Mattias Engdegård @ 2021-07-10 20:05 UTC (permalink / raw)
  To: Lars Ingebrigtsen, Eli Zaretskii; +Cc: 49449

[-- Attachment #1: Type: text/plain, Size: 189 bytes --]

Here is a slightly more principled solution. It blocks attempts at TLS handshaking until the nonblocking connect has actually been established, since it's pointless to go on beforehand.


[-- Attachment #2: tls-connect.diff --]
[-- Type: application/octet-stream, Size: 626 bytes --]

diff --git a/src/process.c b/src/process.c
index b8c3e4ecfb..c3186eed75 100644
--- a/src/process.c
+++ b/src/process.c
@@ -5232,7 +5232,10 @@ wait_reading_process_output (intmax_t time_limit, int nsecs, int read_kbd,
 #ifdef HAVE_GNUTLS
 		/* Continue TLS negotiation. */
 		if (p->gnutls_initstage == GNUTLS_STAGE_HANDSHAKE_TRIED
-		    && p->is_non_blocking_client)
+		    && p->is_non_blocking_client
+		    /* Don't proceed until we have established a connection. */
+		    && !(fd_callback_info[p->outfd].flags
+			 & NON_BLOCKING_CONNECT_FD))
 		  {
 		    gnutls_try_handshake (p);
 		    p->gnutls_handshakes_tried++;

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-10 19:44                       ` Mattias Engdegård
@ 2021-07-11  6:49                         ` Eli Zaretskii
  2021-07-11  7:42                           ` Mattias Engdegård
  0 siblings, 1 reply; 26+ messages in thread
From: Eli Zaretskii @ 2021-07-11  6:49 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 49449, larsi

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 10 Jul 2021 21:44:21 +0200
> Cc: larsi@gnus.org, 49449@debbugs.gnu.org
> 
> 10 juli 2021 kl. 21.31 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > That's my question: why cannot the sentinel be called in this case?
> > what prevents it from being called?
> 
> In the failing case, wait_reading_process_output calls gnutls_try_handshake early on, which succeeds and this leads to finish_after_tls_connection being called. Here, we have the condition
> 
>   else if ((fd_callback_info[p->outfd].flags & NON_BLOCKING_CONNECT_FD) == 0)
> 
> which gates further progress, but this condition is false because the flags have NON_BLOCKING_CONNECT_FD set.

Thanks.  A potentially silly question: why not reset the
NON_BLOCKING_CONNECT_FD bit before we call finish_after_tls_connection
from that place?





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-11  6:49                         ` Eli Zaretskii
@ 2021-07-11  7:42                           ` Mattias Engdegård
  2021-07-11 10:14                             ` Eli Zaretskii
  0 siblings, 1 reply; 26+ messages in thread
From: Mattias Engdegård @ 2021-07-11  7:42 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 49449, larsi

11 juli 2021 kl. 08.49 skrev Eli Zaretskii <eliz@gnu.org>:

> why not reset the
> NON_BLOCKING_CONNECT_FD bit before we call finish_after_tls_connection
> from that place?

That's tantamount to jamming a metal screwdriver into the condition on line 3277, which was indeed the first thing I tried. Unfortunately it doesn't work -- after the sentinel is called, no further progress is made, probably because we (by lying) haven't set up the connection properly.






^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-11  7:42                           ` Mattias Engdegård
@ 2021-07-11 10:14                             ` Eli Zaretskii
  2021-07-11 14:26                               ` Mattias Engdegård
  0 siblings, 1 reply; 26+ messages in thread
From: Eli Zaretskii @ 2021-07-11 10:14 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 49449, larsi

> Feedback-ID:mattiase@acm.or
> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sun, 11 Jul 2021 09:42:34 +0200
> Cc: larsi@gnus.org, 49449@debbugs.gnu.org
> 
> 11 juli 2021 kl. 08.49 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > why not reset the
> > NON_BLOCKING_CONNECT_FD bit before we call finish_after_tls_connection
> > from that place?
> 
> That's tantamount to jamming a metal screwdriver into the condition on line 3277, which was indeed the first thing I tried. Unfortunately it doesn't work -- after the sentinel is called, no further progress is made, probably because we (by lying) haven't set up the connection properly.

Did you succeed in understanding what else has to happen before that
flag could be safely reset?

And anyway, if those conditions are not yet set, I wonder why are we
calling finish_after_tls_connection at that place?





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-10 18:23               ` Mattias Engdegård
  2021-07-10 18:54                 ` Eli Zaretskii
  2021-07-10 20:05                 ` Mattias Engdegård
@ 2021-07-11 11:29                 ` Lars Ingebrigtsen
  2021-07-11 14:28                   ` Mattias Engdegård
  2 siblings, 1 reply; 26+ messages in thread
From: Lars Ingebrigtsen @ 2021-07-11 11:29 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 49449

Mattias Engdegård <mattiase@acm.org> writes:

> If HAVE_GETADDRINFO_A is #undef'ed, it seems to behave like
> macOS and fail in the same way if busy-waiting for one second after
> the url-http call. Can you confirm?

With the test case and HAVE_GETADDRINFO_A undeffed, I'm still not able
to reproduce the problem on Debian/bullseye.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-10 20:05                 ` Mattias Engdegård
@ 2021-07-11 11:31                   ` Lars Ingebrigtsen
  0 siblings, 0 replies; 26+ messages in thread
From: Lars Ingebrigtsen @ 2021-07-11 11:31 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 49449

Mattias Engdegård <mattiase@acm.org> writes:

> Here is a slightly more principled solution. It blocks attempts at TLS
> handshaking until the nonblocking connect has actually been
> established, since it's pointless to go on beforehand.

[...]

> -		    && p->is_non_blocking_client)
> +		    && p->is_non_blocking_client
> +		    /* Don't proceed until we have established a connection. */
> +		    && !(fd_callback_info[p->outfd].flags
> +			 & NON_BLOCKING_CONNECT_FD))

Yeah, I think that's a sensible change (but I haven't tried it).

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-11 10:14                             ` Eli Zaretskii
@ 2021-07-11 14:26                               ` Mattias Engdegård
  2021-07-11 15:01                                 ` Eli Zaretskii
  0 siblings, 1 reply; 26+ messages in thread
From: Mattias Engdegård @ 2021-07-11 14:26 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 49449, larsi

[-- Attachment #1: Type: text/plain, Size: 612 bytes --]

11 juli 2021 kl. 12.14 skrev Eli Zaretskii <eliz@gnu.org>:

> Did you succeed in understanding what else has to happen before that
> flag could be safely reset?

Yes. The TCP connection needs to be established, the socket descriptor removed from write monitoring (because it is now connected) and added to read monitoring (so that we can get incoming traffic).

This suggests a second solution: remove the condition on line 3277 on the grounds that since the TLS handshake succeeded, we are evidently connected; then remove the write monitoring and add the read monitoring before calling the sentinel:


[-- Attachment #2: alt.diff --]
[-- Type: application/octet-stream, Size: 649 bytes --]

diff --git a/src/process.c b/src/process.c
index b8c3e4ecfb..7d2a9e9a62 100644
--- a/src/process.c
+++ b/src/process.c
@@ -3274,8 +3274,13 @@ finish_after_tls_connection (Lisp_Object proc)
       pset_status (p, Qfailed);
       deactivate_process (proc);
     }
-  else if ((fd_callback_info[p->outfd].flags & NON_BLOCKING_CONNECT_FD) == 0)
+  else
     {
+      delete_write_fd (p->outfd);
+
+      if (0 <= p->infd && !EQ (p->filter, Qt) && !EQ (p->command, Qt))
+	add_process_read_fd (p->infd);
+
       /* If we cleared the connection wait mask before we did the TLS
 	 setup, then we have to say that the process is finally "open"
 	 here. */

[-- Attachment #3: Type: text/plain, Size: 992 bytes --]


> And anyway, if those conditions are not yet set, I wonder why are we
> calling finish_after_tls_connection at that place?

There's no harm in calling `gnutls_handshake`; it will just return E_AGAIN if the connection hasn't been established. On the other hand there's little point in doing so until we have a connection.

Which suggests a third solution: do the handshake right away after establishing the connection. That would go into the code somewhere before line 5900, which right now is a condition that I don't quite understand. I think Lars wrote it but apparently forgot all about it (happens to everyone once in a while).

I still favour the less intrusive patch posted previously (adding a condition at line 5235) since it avoids duplication; there is already far too much of that in the code (everything seems to be done in at least two places). The code is obviously in the need of restructuring, but we shouldn't conflate that effort with fixing this specific bug.


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-11 11:29                 ` Lars Ingebrigtsen
@ 2021-07-11 14:28                   ` Mattias Engdegård
  0 siblings, 0 replies; 26+ messages in thread
From: Mattias Engdegård @ 2021-07-11 14:28 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 49449

11 juli 2021 kl. 13.29 skrev Lars Ingebrigtsen <larsi@gnus.org>:

> With the test case and HAVE_GETADDRINFO_A undeffed, I'm still not able
> to reproduce the problem on Debian/bullseye.

Hard to say what's going on since the problem is somewhat timing-sensitive, but it's easy enough to sprinkle printfs for finding out what code paths are taken.






^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-11 14:26                               ` Mattias Engdegård
@ 2021-07-11 15:01                                 ` Eli Zaretskii
  2021-07-12 14:57                                   ` Mattias Engdegård
  0 siblings, 1 reply; 26+ messages in thread
From: Eli Zaretskii @ 2021-07-11 15:01 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 49449, larsi

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sun, 11 Jul 2021 16:26:47 +0200
> Cc: larsi@gnus.org, 49449@debbugs.gnu.org
> 
> 
> [1:text/plain Hide]
> 
> 11 juli 2021 kl. 12.14 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > Did you succeed in understanding what else has to happen before that
> > flag could be safely reset?
> 
> Yes. The TCP connection needs to be established, the socket descriptor removed from write monitoring (because it is now connected) and added to read monitoring (so that we can get incoming traffic).
> 
> This suggests a second solution: remove the condition on line 3277 on the grounds that since the TLS handshake succeeded, we are evidently connected; then remove the write monitoring and add the read monitoring before calling the sentinel:
> 
> 
> [2:application/octet-stream Show Save:alt.diff (649B)]
> 
> 
> [3:text/plain Hide]
> 
> 
> > And anyway, if those conditions are not yet set, I wonder why are we
> > calling finish_after_tls_connection at that place?
> 
> There's no harm in calling `gnutls_handshake`; it will just return E_AGAIN if the connection hasn't been established. On the other hand there's little point in doing so until we have a connection.
> 
> Which suggests a third solution: do the handshake right away after establishing the connection. That would go into the code somewhere before line 5900, which right now is a condition that I don't quite understand. I think Lars wrote it but apparently forgot all about it (happens to everyone once in a while).

Thanks for the explanations.

> I still favour the less intrusive patch posted previously (adding a condition at line 5235) since it avoids duplication; there is already far too much of that in the code (everything seems to be done in at least two places). The code is obviously in the need of restructuring, but we shouldn't conflate that effort with fixing this specific bug.

I tend to agree.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-11 15:01                                 ` Eli Zaretskii
@ 2021-07-12 14:57                                   ` Mattias Engdegård
  2021-07-12 15:02                                     ` Lars Ingebrigtsen
  0 siblings, 1 reply; 26+ messages in thread
From: Mattias Engdegård @ 2021-07-12 14:57 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 49449, larsi

[-- Attachment #1: Type: text/plain, Size: 992 bytes --]

11 juli 2021 kl. 17.01 skrev Eli Zaretskii <eliz@gnu.org>:

>> I still favour the less intrusive patch posted previously (adding a condition at line 5235) since it avoids duplication; there is already far too much of that in the code (everything seems to be done in at least two places). The code is obviously in the need of restructuring, but we shouldn't conflate that effort with fixing this specific bug.
> 
> I tend to agree.

Attached is the patch that I intend to push if there are no objections. The actual change is the same as before and I anticipate no trouble arising from it but tests are usually more fragile.

This issue could very well be the root cause of or at least connected to other bugs: maybe bug#36017 or bug#34341? In any case it's good to see it fixed; it annoyed me (with GNU ELPA in particular) for quite some time and the various unsatisfactory workarounds suggested each time this came up (such as using HTTP instead of HTTPS) are no longer required.


[-- Attachment #2: 0001-Block-TLS-handshake-until-TCP-connection-established.patch --]
[-- Type: application/octet-stream, Size: 3703 bytes --]

From 88b0b7e0dd0daec88eab7dde0c9bd3263d8b52de Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Mon, 12 Jul 2021 13:58:28 +0200
Subject: [PATCH] Block TLS handshake until TCP connection established

If a TLS handshake is attempted before the completion of an
asynchronous TCP connection has been ascertained, our local state will
not be set up correctly for further progress and the sentinel "open"
event will never be sent.  This can occur if sufficient time passes
after the initiation of an async TCP connection so that by the time
`wait_reading_process_output` is called, the connection has already
been established on the TCP level.

This somewhat timing-sensitive bug has plagued HTTPS connections on
some platforms, notably macOS, for a long time (bug#49449).

* src/process.c (wait_reading_process_output): Gate the TLS handshake
by the NON_BLOCKING_CONNECT_FD flag.  The flag will be cleared as soon
as the TCP socket is found to be writable.
* test/src/process-tests.el (process-async-https-with-delay):
New test.
---
 src/process.c             |  5 ++++-
 test/src/process-tests.el | 30 ++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/src/process.c b/src/process.c
index b8c3e4ecfb..c3186eed75 100644
--- a/src/process.c
+++ b/src/process.c
@@ -5232,7 +5232,10 @@ wait_reading_process_output (intmax_t time_limit, int nsecs, int read_kbd,
 #ifdef HAVE_GNUTLS
 		/* Continue TLS negotiation. */
 		if (p->gnutls_initstage == GNUTLS_STAGE_HANDSHAKE_TRIED
-		    && p->is_non_blocking_client)
+		    && p->is_non_blocking_client
+		    /* Don't proceed until we have established a connection. */
+		    && !(fd_callback_info[p->outfd].flags
+			 & NON_BLOCKING_CONNECT_FD))
 		  {
 		    gnutls_try_handshake (p);
 		    p->gnutls_handshakes_tried++;
diff --git a/test/src/process-tests.el b/test/src/process-tests.el
index 1774f2fc74..9bab523708 100644
--- a/test/src/process-tests.el
+++ b/test/src/process-tests.el
@@ -28,6 +28,7 @@
 (require 'puny)
 (require 'subr-x)
 (require 'dns)
+(require 'url-http)
 
 ;; Timeout in seconds; the test fails if the timeout is reached.
 (defvar process-test-sentinel-wait-timeout 2.0)
@@ -916,5 +917,34 @@ process-sentinel-interrupt-event
       ;; ...and the change description should be "interrupt".
       (should (equal '("interrupt\n") events)))))
 
+(ert-deftest process-async-https-with-delay ()
+  "Bug#49449: asynchronous TLS connection with delayed completion."
+  (skip-unless (and internet-is-working (gnutls-available-p)))
+  (let* ((status nil)
+         (buf (url-http
+                 #s(url "https" nil nil "elpa.gnu.org" nil
+                        "/packages/archive-contents" nil nil t silent t t)
+                 (lambda (s) (setq status s))
+                 '(nil) nil 'tls)))
+    (unwind-protect
+        (progn
+          ;; Busy-wait for 1 s to allow for the TCP connection to complete.
+          (let ((delay 1.0)
+                (t0 (float-time)))
+            (while (< (float-time) (+ t0 delay))))
+          ;; Wait for the entire operation to finish.
+          (let ((limit 4.0)
+                (t0 (float-time)))
+            (while (and (null status)
+                        (< (float-time) (+ t0 limit)))
+              (sit-for 0.1)))
+          (should status)
+          (should-not (assq :error status))
+          (should buf)
+          (should (> (buffer-size buf) 0))
+          )
+      (when buf
+        (kill-buffer buf)))))
+
 (provide 'process-tests)
 ;;; process-tests.el ends here
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-12 14:57                                   ` Mattias Engdegård
@ 2021-07-12 15:02                                     ` Lars Ingebrigtsen
  2021-07-13 17:08                                       ` Mattias Engdegård
  0 siblings, 1 reply; 26+ messages in thread
From: Lars Ingebrigtsen @ 2021-07-12 15:02 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 49449

Mattias Engdegård <mattiase@acm.org> writes:

> Attached is the patch that I intend to push if there are no
> objections. The actual change is the same as before and I anticipate
> no trouble arising from it but tests are usually more fragile.

Looks good to me.

> This issue could very well be the root cause of or at least connected
> to other bugs: maybe bug#36017 or bug#34341? 

bug#36017 looks similar, at least, so it'd be good to poke that bug and
ask whether this patch fixes that problem, too.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#49449: 28: TLS connection never gets to "open" stage
  2021-07-12 15:02                                     ` Lars Ingebrigtsen
@ 2021-07-13 17:08                                       ` Mattias Engdegård
  0 siblings, 0 replies; 26+ messages in thread
From: Mattias Engdegård @ 2021-07-13 17:08 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 49449

12 juli 2021 kl. 17.02 skrev Lars Ingebrigtsen <larsi@gnus.org>:

> Looks good to me.

Thanks, pushed to master.

> bug#36017 looks similar, at least, so it'd be good to poke that bug and
> ask whether this patch fixes that problem, too.

Done.






^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-07-13 17:08 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-06 19:12 bug#49449: 28: TLS connection never gets to "open" stage Mattias Engdegård
2021-07-07 19:57 ` Lars Ingebrigtsen
2021-07-08  7:59   ` Mattias Engdegård
2021-07-08 12:54     ` Lars Ingebrigtsen
2021-07-08 16:47       ` Mattias Engdegård
2021-07-10 16:27         ` Lars Ingebrigtsen
2021-07-10 16:51           ` Mattias Engdegård
2021-07-10 16:57             ` Lars Ingebrigtsen
2021-07-10 17:07               ` Mattias Engdegård
2021-07-10 18:23               ` Mattias Engdegård
2021-07-10 18:54                 ` Eli Zaretskii
2021-07-10 19:22                   ` Mattias Engdegård
2021-07-10 19:31                     ` Eli Zaretskii
2021-07-10 19:44                       ` Mattias Engdegård
2021-07-11  6:49                         ` Eli Zaretskii
2021-07-11  7:42                           ` Mattias Engdegård
2021-07-11 10:14                             ` Eli Zaretskii
2021-07-11 14:26                               ` Mattias Engdegård
2021-07-11 15:01                                 ` Eli Zaretskii
2021-07-12 14:57                                   ` Mattias Engdegård
2021-07-12 15:02                                     ` Lars Ingebrigtsen
2021-07-13 17:08                                       ` Mattias Engdegård
2021-07-10 20:05                 ` Mattias Engdegård
2021-07-11 11:31                   ` Lars Ingebrigtsen
2021-07-11 11:29                 ` Lars Ingebrigtsen
2021-07-11 14:28                   ` Mattias Engdegård

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).