It's still not what I would call "fast", though, usually taking a couple of seconds... but a cursory investigation is so far pointing the finger at garbage collection being triggered during frame creation, more often than not. If I raise gc-cons-threshold by a factor of 10, frame creation and display is fairly quick (presumably only 90-95% of the time). The C stack traces from the garbage collection calls are pretty boring (maybe_gc gets called from Ffuncall which gets called from exec_byte_code etc); the Lisp backtraces point the finger mostly at internal-face-x-get-resource as the function being entered when GC gets invoked.
Lisp Backtrace:
"internal-face-x-get-resource" (0x1910fb80)
"set-face-attribute-from-resource" (0x1910fd80)
"set-face-attributes-from-resources" (0x1910ff80)
"make-face-x-resource-internal" (0x19110170)
"face-spec-recalc" (0x19110360)
"face-set-after-frame-default" (0x19110560)
"x-create-frame-with-faces" (0x19110758)
0x121eac8 PVEC_COMPILED
"apply" (0x19110a10)
"frame-creation-function" (0x19110c10)
"make-frame" (0x19110dc0)
"make-frame-on-display" (0x19110f88)
"server-create-window-system-frame" (0x191111b8)
"server-process-filter" (0x19111398)
A memory profile report of frame creation via emacsclient includes this breakdown:
- server-process-filter 2,898,676 2%
- server-create-window-system-frame 1,526,354 1%
- make-frame-on-display 1,522,279 1%
- make-frame 1,522,279 1%
- frame-creation-function 1,509,219 1%
- apply 1,509,219 1%
- #<compiled 0x487ab3> 1,509,219 1%
- x-create-frame-with-faces 1,509,219 1%
- face-set-after-frame-default 372,148 0%
- face-spec-recalc 371,868 0%
- make-face-x-resource-internal 368,460 0%
set-face-attributes-from-resources 368,460 0%
+ face-spec-choose 2,384 0%
+ face-spec-set-2 1,024 0%
normal-erase-is-backspace-setup-frame 13,036 0%
+ run-hook-with-args 24 0%
window-system-for-display 1,024 0%
- server-execute-continuation 1,355,452 1%
- #<compiled 0x175dd37> 1,355,452 1%
- server-execute 1,349,604 1%
switch-to-buffer 1,269,569 1%
+ server-delete-client 36,910 0%
However, I'm skeptical of the numbers since the report also indicated that read-from-minibuffer (but not things it called) used 100M (bytes? cells?) in those few seconds. (Hence the >2MB shown here being such a tiny percentage.) And the numbers vary a lot from one attempt to another, though the proportions seem to be fairly consistent.