unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* commit-msg hook
@ 2015-04-10 10:43 Eli Zaretskii
  2015-04-10 18:23 ` Johan Bockgård
  2015-04-11  2:42 ` Paul Eggert
  0 siblings, 2 replies; 27+ messages in thread
From: Eli Zaretskii @ 2015-04-10 10:43 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

On MS-Windows, this hook fails commits with valid UTF-8 characters in
the log messages, probably because the version of Gawk I have here is
too old (3.0.4; that's what MSYS comes with).

The mawk 1.3.3 replacement regexp for non_print does work, so I
suggest to have a Gawk version check in the script, and use the
replacement for older Gawk's.

Btw, when the hook is run by Git, the "; see 'CONTRIBUTE'" part of the
message is not shown; could it be that the semi-colon gets interpreted
by the shell, or something like that?  (My Git version is 1.9.5.)

I would also suggest to say explicitly in the message that the commit
was aborted, because it is not at all clear, at least to the
uninitiated.  It wasn't until I did a "git show" that the fact the
commit was aborted became apparent to me.

Thanks.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-10 10:43 commit-msg hook Eli Zaretskii
@ 2015-04-10 18:23 ` Johan Bockgård
  2015-04-11  2:42 ` Paul Eggert
  1 sibling, 0 replies; 27+ messages in thread
From: Johan Bockgård @ 2015-04-10 18:23 UTC (permalink / raw)
  To: emacs-devel


Also, the hook should ignore the stuff after

    # ------------------------ >8 ------------------------
    # Do not touch the line above.
    # Everything below will be removed.

when running the checks on the commit message. (Inserted by commit -v.)




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-10 10:43 commit-msg hook Eli Zaretskii
  2015-04-10 18:23 ` Johan Bockgård
@ 2015-04-11  2:42 ` Paul Eggert
  2015-04-11  7:24   ` Eli Zaretskii
  1 sibling, 1 reply; 27+ messages in thread
From: Paul Eggert @ 2015-04-11  2:42 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Johan Bockgård, emacs-devel

Eli Zaretskii wrote:
> On MS-Windows, this hook fails commits with valid UTF-8 characters in
> the log messages, probably because the version of Gawk I have here is
> too old (3.0.4; that's what MSYS comes with).

Ouch.  That's old.  Why is MSYS still using a circa-1999 gawk?  Can we talk the 
MSYS folks into having something more up-to-date?

> The mawk 1.3.3 replacement regexp for non_print does work, so I
> suggest to have a Gawk version check in the script

It's better to check for features rather than versions.  I got a copy of Gawk 
3.0.4 out of mothballs and hacked for a while until I got something that worked 
for me.

> Btw, when the hook is run by Git, the "; see 'CONTRIBUTE'" part of the
> message is not shown; could it be that the semi-colon gets interpreted
> by the shell, or something like that?  (My Git version is 1.9.5.)

Most likely you've got an obsolete hook.  Please try "git pull" followed by 
"./autogen.sh".

> I would also suggest to say explicitly in the message that the commit
> was aborted, because it is not at all clear

Thanks, good suggestion.

Johan Bockgård wrote:
 > Also, the hook should ignore the stuff after
 >
 >      # ------------------------ >8 ------------------------
 >      # Do not touch the line above.
 >      # Everything below will be removed.
 >
 > when running the checks on the commit message. (Inserted by commit -v.)

Thanks, I had forgotten about commit -v.

I installed some patches to try to fix the above problems (assuming my guess is 
right about the obsolete hook).  You can do an "./autogen.sh" after pulling 
these patches.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-11  2:42 ` Paul Eggert
@ 2015-04-11  7:24   ` Eli Zaretskii
  2015-04-11  9:55     ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2015-04-11  7:24 UTC (permalink / raw)
  To: Paul Eggert; +Cc: bojohan, emacs-devel

> Date: Fri, 10 Apr 2015 19:42:53 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> CC: emacs-devel@gnu.org, Johan Bockgård
>  <bojohan@gnu.org>
> 
> Eli Zaretskii wrote:
> > On MS-Windows, this hook fails commits with valid UTF-8 characters in
> > the log messages, probably because the version of Gawk I have here is
> > too old (3.0.4; that's what MSYS comes with).
> 
> Ouch.  That's old.  Why is MSYS still using a circa-1999 gawk?  Can we talk the 
> MSYS folks into having something more up-to-date?

Actually, I see now that this old Gawk comes with the Windows port of
Git (which includes part of [an older] MSYS).  The official MSYS
distribution, by contrast, comes with Gawk 3.1.7, which is 10 years
older (2009), but it still doesn't support [[:print:]].  When was that
introduced?

Hmm...  I now tried a native Windows Gawk, and there even Gawk 3.1.4,
the oldest one I still have, works with [[:print:]].  So I think the
problem is not with [[:print:]] support per se, but instead with UTF-8
multibyte characters in the MSYS build of Gawk (the Windows port of
Git also uses MSYS Gawk, just from an older MSYS, it seems).  I think
MSYS only supports non-ASCII characters encoded in codepage 1252.

> It's better to check for features rather than versions.  I got a copy of Gawk 
> 3.0.4 out of mothballs and hacked for a while until I got something that worked 
> for me.

Thanks, this would be even better.

However, the new hook hangs on Windows, in this line:

  print_at_sign='{print substr("'$cent_sign'@", 2)}'

It hangs with both MSYS and native Gawk, including the native build of
the latest Gawk 4.1.1c.

I don't think that UTF-8 will work reliabloy in MS-Windows Gawk,
certainly not with Gawk that comes with msysGit.

So it sounds like on MS-Windows we are stuck with the "approximate"
method of detecting non-printable characters.  One way of detecting
MS-Windows from a shell script is to see whether the environment
variable MSYSTEM is defined.

> > Btw, when the hook is run by Git, the "; see 'CONTRIBUTE'" part of the
> > message is not shown; could it be that the semi-colon gets interpreted
> > by the shell, or something like that?  (My Git version is 1.9.5.)
> 
> Most likely you've got an obsolete hook.  Please try "git pull" followed by 
> "./autogen.sh".

Is it possible to install the hooks in GNUMakefile, so that running
autogen.sh manually wouldn't be needed?  I normally almost never run
autogen.sh, I always just run "make" from the top-level directory.

Thanks.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-11  7:24   ` Eli Zaretskii
@ 2015-04-11  9:55     ` Eli Zaretskii
  2015-04-11  9:59       ` Eli Zaretskii
  2015-04-11 15:40       ` Paul Eggert
  0 siblings, 2 replies; 27+ messages in thread
From: Eli Zaretskii @ 2015-04-11  9:55 UTC (permalink / raw)
  To: eggert; +Cc: emacs-devel

> Date: Sat, 11 Apr 2015 10:24:05 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: bojohan@gnu.org, emacs-devel@gnu.org
> 
> However, the new hook hangs on Windows, in this line:
> 
>   print_at_sign='{print substr("'$cent_sign'@", 2)}'
> 
> It hangs with both MSYS and native Gawk, including the native build of
> the latest Gawk 4.1.1c.
> 
> I don't think that UTF-8 will work reliabloy in MS-Windows Gawk,
> certainly not with Gawk that comes with msysGit.

Digging a bit deeper, I see that the problem with hanging is not due
to Gawk, but instead due to 'printf': it doesn't produce the UTF-8
encoding of the percent sign, but some invalid sequence instead.

If I add an explicit

    cent_sign = "¢"

to the Awk part of the script (where the cent sign is UTF-8 encoded),
then the hook correctly processes log messages with UTF-8 encoded
non-ASCII letters and doesn't hang.  It doesn't use the [[:print:]]
alternative, though, probably because UTF-8 encoded characters are not
recognized as such, and a UTF-8 locale cannot be used on MS-Windows to
force that.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-11  9:55     ` Eli Zaretskii
@ 2015-04-11  9:59       ` Eli Zaretskii
  2015-04-11 12:42         ` Dmitry Gutov
  2015-04-11 15:40       ` Paul Eggert
  1 sibling, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2015-04-11  9:59 UTC (permalink / raw)
  To: eggert; +Cc: emacs-devel

> Date: Sat, 11 Apr 2015 12:55:50 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: emacs-devel@gnu.org
> 
> Digging a bit deeper, I see that the problem with hanging is not due
> to Gawk, but instead due to 'printf': it doesn't produce the UTF-8
> encoding of the percent sign, but some invalid sequence instead.
> 
> If I add an explicit
> 
>     cent_sign = "¢"
> 
> to the Awk part of the script (where the cent sign is UTF-8 encoded),
> then the hook correctly processes log messages with UTF-8 encoded
> non-ASCII letters and doesn't hang.  It doesn't use the [[:print:]]
> alternative, though, probably because UTF-8 encoded characters are not
> recognized as such, and a UTF-8 locale cannot be used on MS-Windows to
> force that.

Would it be too expensive to use Emacs for detecting unprintable
characters, instead of the shell and Awk?  All those issues with
handling UTF-8 encoded characters correctly and portability issues
will then simply disappear.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-11  9:59       ` Eli Zaretskii
@ 2015-04-11 12:42         ` Dmitry Gutov
  2015-04-11 14:29           ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Dmitry Gutov @ 2015-04-11 12:42 UTC (permalink / raw)
  To: Eli Zaretskii, eggert; +Cc: emacs-devel

On 04/11/2015 12:59 PM, Eli Zaretskii wrote:

> Would it be too expensive to use Emacs for detecting unprintable
> characters, instead of the shell and Awk?  All those issues with
> handling UTF-8 encoded characters correctly and portability issues
> will then simply disappear.

There's no guarantee that there is an Emacs executable in PATH, nor that 
src/emacs is compiled and working.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-11 12:42         ` Dmitry Gutov
@ 2015-04-11 14:29           ` Eli Zaretskii
  2015-04-11 15:13             ` Dmitry Gutov
  0 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2015-04-11 14:29 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: eggert, emacs-devel

> Date: Sat, 11 Apr 2015 15:42:49 +0300
> From: Dmitry Gutov <dgutov@yandex.ru>
> CC: emacs-devel@gnu.org
> 
> On 04/11/2015 12:59 PM, Eli Zaretskii wrote:
> 
> > Would it be too expensive to use Emacs for detecting unprintable
> > characters, instead of the shell and Awk?  All those issues with
> > handling UTF-8 encoded characters correctly and portability issues
> > will then simply disappear.
> 
> There's no guarantee that there is an Emacs executable in PATH, nor that 
> src/emacs is compiled and working.

On Emacs developer's machine?  Blasphemy!

(We only need _some_ Emacs, not the latest one.)



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-11 14:29           ` Eli Zaretskii
@ 2015-04-11 15:13             ` Dmitry Gutov
  2015-04-11 15:17               ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Dmitry Gutov @ 2015-04-11 15:13 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eggert, emacs-devel

On 04/11/2015 05:29 PM, Eli Zaretskii wrote:

 >> There's no guarantee that there is an Emacs executable in PATH, nor that
>> src/emacs is compiled and working.
>
> On Emacs developer's machine?  Blasphemy!

I do have older Emacs versions built, but they're not installed globally.

> (We only need _some_ Emacs, not the latest one.)

The one that would be in PATH is the master build, but if it breaks, I'd 
still like to be able to commit... I think.

Overall, I think porting the whole commit-msg script from awk (scary!) 
to Elisp would be nice, but it will create this new complication.

At the very least, it will involve a new configuration step, to somehow 
tell Git which Emacs build to use.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-11 15:13             ` Dmitry Gutov
@ 2015-04-11 15:17               ` Eli Zaretskii
  2015-04-12  3:36                 ` Stefan Monnier
  0 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2015-04-11 15:17 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: eggert, emacs-devel

> Date: Sat, 11 Apr 2015 18:13:31 +0300
> From: Dmitry Gutov <dgutov@yandex.ru>
> CC: eggert@cs.ucla.edu, emacs-devel@gnu.org
> 
> > (We only need _some_ Emacs, not the latest one.)
> 
> The one that would be in PATH is the master build, but if it breaks, I'd 
> still like to be able to commit... I think.

How is this different from a broken Gawk?

> Overall, I think porting the whole commit-msg script from awk (scary!) 
> to Elisp would be nice, but it will create this new complication.

I'm not sure the complication is new.  I think we can rely on Emacs
being available no less than we rely on Gawk and Perl nowadays.

> At the very least, it will involve a new configuration step, to somehow 
> tell Git which Emacs build to use.

I think we could just rely on its being on PATH.  All we need is Emacs
23.1 or later.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-11  9:55     ` Eli Zaretskii
  2015-04-11  9:59       ` Eli Zaretskii
@ 2015-04-11 15:40       ` Paul Eggert
  2015-04-11 16:40         ` Eli Zaretskii
  1 sibling, 1 reply; 27+ messages in thread
From: Paul Eggert @ 2015-04-11 15:40 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1602 bytes --]

Eli Zaretskii wrote:
> the problem with hanging is not due
> to Gawk, but instead due to 'printf': it doesn't produce the UTF-8
> encoding of the percent sign

OK, so we're dealing with a broken MS-Windows shell rather than a broken 
MS-Windows gawk.  That 'printf' command is there only so that the script can be 
pure ASCII, and it's simpler to avoid that nicety.  I installed the attached 
patch to try to work around the shell bug.

As regards the idea of using Emacs rather than the shell, I'm with Dmitry: let's 
not assume Emacs is already working when bootstrapping a new Emacs.  Instead, 
let's merely assume that a new developer has a working shell, since that's 
needed anyway for other reasons.

> Is it possible to install the hooks in GNUMakefile, so that running
> autogen.sh manually wouldn't be needed?  I normally almost never run
> autogen.sh, I always just run "make" from the top-level directory.

I'm leery of that.  GNUMakefile is run by everybody, not just by developers. 
And there are good reasons that the Git hooks are not themselves in the Git 
repository (if they were in the repository, and they were sufficiently messed 
up, then one couldn't 'git commit' fixes).  If we routinely updated the Git 
hooks automatically, we'd make it more likely that Git non-experts would put 
their painfully constructed repositories into a "broken" state, and we already 
have too much panic in that neighborhood.

That being said, it might make sense to separate the Git hook (re)initialization 
into an autogen.sh option, since it's cheap whereas autogen.sh by default is 
expensive.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Port-commit-msg-to-broken-MS-Windows-shell.patch --]
[-- Type: text/x-patch; name="0001-Port-commit-msg-to-broken-MS-Windows-shell.patch", Size: 1570 bytes --]

From 2cc9a32d361492f28faeced22e3800eb1a59c252 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Sat, 11 Apr 2015 08:19:13 -0700
Subject: [PATCH] Port commit-msg to broken MS-Windows shell

* build-aux/git-hooks/commit-msg (cent_sign):
Just use UTF-8 here rather than ASCII + printf, as the latter fails
on a broken MS-Windows shell.  Reported by Eli Zaretskii in:
http://lists.gnu.org/archive/html/emacs-devel/2015-04/msg00592.html
---
 build-aux/git-hooks/commit-msg | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/build-aux/git-hooks/commit-msg b/build-aux/git-hooks/commit-msg
index ea8d909..3fc6e19 100755
--- a/build-aux/git-hooks/commit-msg
+++ b/build-aux/git-hooks/commit-msg
@@ -29,8 +29,7 @@ fi
 
 # Use a UTF-8 locale if available, so that the UTF-8 check works.
 # Use U+00A2 CENT SIGN to test whether the locale works.
-cent_sign_utf8_format='\302\242\n'
-cent_sign=`printf "$cent_sign_utf8_format"`
+cent_sign='¢'
 print_at_sign='{print substr("'$cent_sign'@", 2)}'
 at_sign=`$awk "$print_at_sign" 2>/dev/null`
 if test "$at_sign" != @; then
@@ -45,7 +44,7 @@ exec $awk -v at_sign="$at_sign" -v cent_sign="$cent_sign" '
   BEGIN {
     # These regular expressions assume traditional Unix unibyte behavior.
     # They are needed for old or broken versions of awk, e.g.,
-    # mawk 1.3.3 (1996), Gawk 3.0.4 (1999).
+    # mawk 1.3.3 (1996), or gawk on MSYS (2015).
     space = "[ \f\n\r\t\v]"
     non_space = "[^ \f\n\r\t\v]"
     non_print = "[\1-\37\177]"
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-11 15:40       ` Paul Eggert
@ 2015-04-11 16:40         ` Eli Zaretskii
  2015-04-11 20:09           ` Paul Eggert
  0 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2015-04-11 16:40 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

> Date: Sat, 11 Apr 2015 08:40:29 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> CC: emacs-devel@gnu.org
> 
> Eli Zaretskii wrote:
> > the problem with hanging is not due
> > to Gawk, but instead due to 'printf': it doesn't produce the UTF-8
> > encoding of the percent sign
> 
> OK, so we're dealing with a broken MS-Windows shell rather than a broken 
> MS-Windows gawk.  That 'printf' command is there only so that the script can be 
> pure ASCII, and it's simpler to avoid that nicety.  I installed the attached 
> patch to try to work around the shell bug.

That hangs as well, and I found out why: Gawk waits for input.  (I
don't know why this doesn't work as you intended, seems like another
incompatibility in MSYS Bash.)  To fix, replace this line

   print_at_sign='{print substr("'$cent_sign'@", 2)}'

with this:

   print_at_sign='BEGIN {print substr("'$cent_sign'@", 2)}'

and then even your previous version works.

Sorry for my imperfect debugging, there are too many potentially
problematic constructs in this script, and it's not easy to understand
what exactly misfires, especially when Bash shows me corrupted
non-ASCII strings.

Thanks.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-11 16:40         ` Eli Zaretskii
@ 2015-04-11 20:09           ` Paul Eggert
  2015-04-12 16:10             ` Eli Zaretskii
  2015-04-13 15:48             ` Eli Zaretskii
  0 siblings, 2 replies; 27+ messages in thread
From: Paul Eggert @ 2015-04-11 20:09 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 429 bytes --]

Eli Zaretskii wrote:
> That hangs as well, and I found out why: Gawk waits for input.

Ah, that could be due to an incompatibility with the MSYS shell; but the BEGIN 
can't hurt, and I installed the attached patch to redirect from stdin as well. 
Since this appears to be the real problem, I'd rather go back to the all-ASCII 
script, for portability to pre-POSIX shells (probably doesn't matter nowadays, 
but shouldn't hurt).


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Port-commit-msg-to-MSYS-Bash-Gawk.patch --]
[-- Type: text/x-patch; name="0001-Port-commit-msg-to-MSYS-Bash-Gawk.patch", Size: 1549 bytes --]

From ad3a0eac324a7163ab0e345e60f741e2d55087b9 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Sat, 11 Apr 2015 13:04:37 -0700
Subject: [PATCH] Port commit-msg to MSYS Bash+Gawk

See Eli Zaretskii in:
http://lists.gnu.org/archive/html/emacs-devel/2015-04/msg00610.html
* build-aux/git-hooks/commit-msg (cent_sign_utf8_format)
(cent_sign, print_at_sign, at_sign): Revert previous change.
(print_at_sign): Prepend "BEGIN".
(at_sign): Redirect from /dev/null to be safer with pre-POSIX awk.
---
 build-aux/git-hooks/commit-msg | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/build-aux/git-hooks/commit-msg b/build-aux/git-hooks/commit-msg
index 3fc6e19..6e31dbc 100755
--- a/build-aux/git-hooks/commit-msg
+++ b/build-aux/git-hooks/commit-msg
@@ -29,11 +29,12 @@ fi
 
 # Use a UTF-8 locale if available, so that the UTF-8 check works.
 # Use U+00A2 CENT SIGN to test whether the locale works.
-cent_sign='¢'
-print_at_sign='{print substr("'$cent_sign'@", 2)}'
-at_sign=`$awk "$print_at_sign" 2>/dev/null`
+cent_sign_utf8_format='\302\242\n'
+cent_sign=`printf "$cent_sign_utf8_format"`
+print_at_sign='BEGIN {print substr("'$cent_sign'@", 2)}'
+at_sign=`$awk "$print_at_sign" </dev/null 2>/dev/null`
 if test "$at_sign" != @; then
-  at_sign=`LC_ALL=en_US.UTF-8 $awk "$print_at_sign" 2>/dev/null`
+  at_sign=`LC_ALL=en_US.UTF-8 $awk "$print_at_sign" </dev/null 2>/dev/null`
   if test "$at_sign" = @; then
     LC_ALL=en_US.UTF-8; export LC_ALL
   fi
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-11 15:17               ` Eli Zaretskii
@ 2015-04-12  3:36                 ` Stefan Monnier
  2015-04-12 18:54                   ` chad
  0 siblings, 1 reply; 27+ messages in thread
From: Stefan Monnier @ 2015-04-12  3:36 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eggert, emacs-devel, Dmitry Gutov

> I think we could just rely on its being on PATH.  All we need is Emacs
> 23.1 or later.

FWIW, I think that Emacs>=23 is rarely in PATH under Mac OS X.


        Stefan



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-11 20:09           ` Paul Eggert
@ 2015-04-12 16:10             ` Eli Zaretskii
  2015-04-13 15:48             ` Eli Zaretskii
  1 sibling, 0 replies; 27+ messages in thread
From: Eli Zaretskii @ 2015-04-12 16:10 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

> Date: Sat, 11 Apr 2015 13:09:12 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> CC: emacs-devel@gnu.org
> 
> Eli Zaretskii wrote:
> > That hangs as well, and I found out why: Gawk waits for input.
> 
> Ah, that could be due to an incompatibility with the MSYS shell; but the BEGIN 
> can't hurt, and I installed the attached patch to redirect from stdin as well. 
> Since this appears to be the real problem, I'd rather go back to the all-ASCII 
> script, for portability to pre-POSIX shells (probably doesn't matter nowadays, 
> but shouldn't hurt).

Thanks, this version works for me.

Once again, sorry for not detecting the real culprit sooner.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-12  3:36                 ` Stefan Monnier
@ 2015-04-12 18:54                   ` chad
  0 siblings, 0 replies; 27+ messages in thread
From: chad @ 2015-04-12 18:54 UTC (permalink / raw)
  To: emacs-devel


> On 11 Apr 2015, at 20:36, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
> 
>> I think we could just rely on its being on PATH.  All we need is Emacs
>> 23.1 or later.
> 
> FWIW, I think that Emacs>=23 is rarely in PATH under Mac OS X.

Stefan is correct. The shipped-with default emacs on the latest macosx is:

	GNU Emacs 22.1.1 (mac-apple-darwin) of 2015-03-23 on osx202.apple.com

I believe that the majority of users who install a newer version will NOT end up with it in PATH.

~Chad


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-11 20:09           ` Paul Eggert
  2015-04-12 16:10             ` Eli Zaretskii
@ 2015-04-13 15:48             ` Eli Zaretskii
  2015-04-13 18:37               ` Paul Eggert
  1 sibling, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2015-04-13 15:48 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

> Date: Sat, 11 Apr 2015 13:09:12 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> Cc: emacs-devel@gnu.org
> 
> Since this appears to be the real problem, I'd rather go back to the all-ASCII 
> script, for portability to pre-POSIX shells (probably doesn't matter nowadays, 
> but shouldn't hurt).

Actually, I've been thinking: why do we need to rely on system
libraries to implement UTF-8 and [:print:] correctly?  According to
the Unicode Standard, [:print:] should reject only a small number of
special characters:

  . control characters between 0x01 and 0x1f and 0x7f to 0x9f
  . surrogates
  . unassigned codepoints

I think we should not detect unassigned codepoints, since hundreds of
them are assigned every year, so we are likely to trigger false
positives.

As for the other two groups, it should be quite easy to detect their
UTF-8 sequences with relatively simple regular expressions, without
relying on system libraries or up-to-date Gawk/whatever, if we assume
that commit log messages must always be UTF-8 encoded.  WDYT?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-13 15:48             ` Eli Zaretskii
@ 2015-04-13 18:37               ` Paul Eggert
  2015-04-13 20:18                 ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Eggert @ 2015-04-13 18:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On 04/13/2015 08:48 AM, Eli Zaretskii wrote:
> why do we need to rely on system
> libraries to implement UTF-8 and [:print:] correctly?
Some POSIXish environments do not support unibyte locales; even the C 
locale is multibyte (it uses UTF-8).  In such environments, I expect 
that a regular expression like /\342\202\254/ won't necessarily match 
the string "€" (U+20AC, equivalent to the byte string "\342\202\254") in 
any locale, just as in Emacs the call (re-search-forward "\342\202\254") 
won't find the string "€" in a UTF-8 file, regardless of language settings.

I suppose the script could detect whether we're in such an environment, 
but I dunno, it sounds like more trouble than it's worth.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-13 18:37               ` Paul Eggert
@ 2015-04-13 20:18                 ` Eli Zaretskii
  2015-04-13 21:19                   ` Paul Eggert
  0 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2015-04-13 20:18 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

> Date: Mon, 13 Apr 2015 11:37:35 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> CC: emacs-devel@gnu.org
> 
> On 04/13/2015 08:48 AM, Eli Zaretskii wrote:
> > why do we need to rely on system
> > libraries to implement UTF-8 and [:print:] correctly?
> Some POSIXish environments do not support unibyte locales; even the C 
> locale is multibyte (it uses UTF-8).

Gawk has the --characters-as-bytes option since v4.0.0, which should
countermand that, I think.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-13 20:18                 ` Eli Zaretskii
@ 2015-04-13 21:19                   ` Paul Eggert
  2015-04-14 15:08                     ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Eggert @ 2015-04-13 21:19 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On 04/13/2015 01:18 PM, Eli Zaretskii wrote:
> Gawk has the --characters-as-bytes option since v4.0.0, which should
> countermand that, I think.

Sure, although the code should work even plain POSIX awk, as there 
should be no need to assume such a GNU extension when bootstrapping.  
That is, the script could support either:

1. POSIX awk with multibyte OS support, with proper UTF-8 checking from 
OS libraries; or

2. GNU awk 4 (2012) or later, with nearly-as-good UTF-8 checking 
hand-coded into the script; or

3. Traditional awk without UTF-8 checking.

Currently the script supports (1) and (3) but someone could add support 
for (2).



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-13 21:19                   ` Paul Eggert
@ 2015-04-14 15:08                     ` Eli Zaretskii
  2015-04-14 17:01                       ` Paul Eggert
  0 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2015-04-14 15:08 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

> Date: Mon, 13 Apr 2015 14:19:51 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> CC: emacs-devel@gnu.org
> 
> On 04/13/2015 01:18 PM, Eli Zaretskii wrote:
> > Gawk has the --characters-as-bytes option since v4.0.0, which should
> > countermand that, I think.
> 
> Sure, although the code should work even plain POSIX awk, as there 
> should be no need to assume such a GNU extension when bootstrapping.  
> That is, the script could support either:
> 
> 1. POSIX awk with multibyte OS support, with proper UTF-8 checking from 
> OS libraries; or
> 
> 2. GNU awk 4 (2012) or later, with nearly-as-good UTF-8 checking 
> hand-coded into the script; or
> 
> 3. Traditional awk without UTF-8 checking.
> 
> Currently the script supports (1) and (3) but someone could add support 
> for (2).

How about the following change?  It improves on (3), and worked for me
both on MS-Windows and on GNU/Linux.

--- ./.git/hooks/commit-msg.~5~	2015-04-12 19:11:27.481125000 +0300
+++ ./.git/hooks/commit-msg	2015-04-14 11:11:02.000000000 +0300
@@ -45,10 +45,13 @@
   BEGIN {
     # These regular expressions assume traditional Unix unibyte behavior.
     # They are needed for old or broken versions of awk, e.g.,
-    # mawk 1.3.3 (1996), or gawk on MSYS (2015).
+    # mawk 1.3.3 (1996), or gawk on MSYS (2015), and/or for systems that
+    # cannot use UTF-8 as the codeset for the locale.
     space = "[ \f\n\r\t\v]"
     non_space = "[^ \f\n\r\t\v]"
-    non_print = "[\1-\37\177]"
+    # The non_print below rejects control characters and surrogates
+    # UTF-8 for: 0x01-0x1f 0x7f   0x80-0x9f    0xd800-0xdbff     0xdc00-0xdfff
+    non_print = "[\1-\37\177]|\302[\200-\237]|\355([\240-\257]|[\260-\277])[\200-\277]"
 
     # Prefer POSIX regular expressions if available, as they do a
     # better job of checking.  Similarly, prefer POSIX negated



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-14 15:08                     ` Eli Zaretskii
@ 2015-04-14 17:01                       ` Paul Eggert
  2015-04-14 17:09                         ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Eggert @ 2015-04-14 17:01 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On 04/14/2015 08:08 AM, Eli Zaretskii wrote:
> +    non_print = "[\1-\37\177]|\302[\200-\237]|\355([\240-\257]|[\260-\277])[\200-\277]"

This sort of thing should work in a unibyte environment, but it needs to 
be used only after testing that we actually are in a unibyte environment.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-14 17:01                       ` Paul Eggert
@ 2015-04-14 17:09                         ` Eli Zaretskii
  2015-04-14 17:42                           ` Paul Eggert
  0 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2015-04-14 17:09 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

> Date: Tue, 14 Apr 2015 10:01:12 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> CC: emacs-devel@gnu.org
> 
> On 04/14/2015 08:08 AM, Eli Zaretskii wrote:
> > +    non_print = "[\1-\37\177]|\302[\200-\237]|\355([\240-\257]|[\260-\277])[\200-\277]"
> 
> This sort of thing should work in a unibyte environment, but it needs to 
> be used only after testing that we actually are in a unibyte environment.

I thought that's what all the tests with cent_sign and at_sign do,
don't they?

Anyway, what bad things could happen if this regular expression is
used in a multibyte environment?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-14 17:09                         ` Eli Zaretskii
@ 2015-04-14 17:42                           ` Paul Eggert
  2015-04-14 18:01                             ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Eggert @ 2015-04-14 17:42 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On 04/14/2015 10:09 AM, Eli Zaretskii wrote:
>> This sort of thing should work in a unibyte environment, but it needs to
>> >be used only after testing that we actually are in a unibyte environment.
> I thought that's what all the tests with cent_sign and at_sign do,
> don't they?

No, they test for something more specific, namely, whether we're in a 
UTF-8 locale.  Not every multibyte locale uses UTF-8.

> what bad things could happen if this regular expression is
> used in a multibyte environment?
I suppose it could cause the script to print "Unprintable character in 
commit message" even though all the message's characters are actually 
printable.

How about this idea?  Before falling back to the unibyte regular 
expressions in awk, set LC_ALL='C' in the environment.  This should work 
well enough, as in practice all environments where the C locale is 
multibyte have working UTF-8 so they won't need to fall back to unibyte 
anyway.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-14 17:42                           ` Paul Eggert
@ 2015-04-14 18:01                             ` Eli Zaretskii
  2015-04-14 18:32                               ` Paul Eggert
  0 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2015-04-14 18:01 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

> Date: Tue, 14 Apr 2015 10:42:53 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> CC: emacs-devel@gnu.org
> 
> How about this idea?  Before falling back to the unibyte regular 
> expressions in awk, set LC_ALL='C' in the environment.  This should work 
> well enough, as in practice all environments where the C locale is 
> multibyte have working UTF-8 so they won't need to fall back to unibyte 
> anyway.

You mean, like below?

--- ./.git/hooks/commit-msg.~5~	2015-04-12 19:11:27.481125000 +0300
+++ ./.git/hooks/commit-msg	2015-04-14 21:01:14.481125000 +0300
@@ -37,6 +37,8 @@
   at_sign=`LC_ALL=en_US.UTF-8 $awk "$print_at_sign" </dev/null 2>/dev/null`
   if test "$at_sign" = @; then
     LC_ALL=en_US.UTF-8; export LC_ALL
+  else
+    LC_ALL=C; export LC_ALL
   fi
 fi
 
@@ -45,10 +47,13 @@
   BEGIN {
     # These regular expressions assume traditional Unix unibyte behavior.
     # They are needed for old or broken versions of awk, e.g.,
-    # mawk 1.3.3 (1996), or gawk on MSYS (2015).
+    # mawk 1.3.3 (1996), or gawk on MSYS (2015), and/or for systems that
+    # cannot use UTF-8 as the codeset for the locale.
     space = "[ \f\n\r\t\v]"
     non_space = "[^ \f\n\r\t\v]"
-    non_print = "[\1-\37\177]"
+    # The non_print below rejects control characters and surrogates
+    # UTF-8 for: 0x01-0x1f 0x7f   0x80-0x9f    0xd800-0xdbff     0xdc00-0xdfff
+    non_print = "[\1-\37\177]|\302[\200-\237]|\355([\240-\257]|[\260-\277])[\200-\277]"
 
     # Prefer POSIX regular expressions if available, as they do a
     # better job of checking.  Similarly, prefer POSIX negated



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-14 18:01                             ` Eli Zaretskii
@ 2015-04-14 18:32                               ` Paul Eggert
  2015-04-14 18:59                                 ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Eggert @ 2015-04-14 18:32 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On 04/14/2015 11:01 AM, Eli Zaretskii wrote:
> You mean, like below?

Looks good, yes.  Some nits:

>
> --- ./.git/hooks/commit-msg.~5~	2015-04-12 19:11:27.481125000 +0300
> +++ ./.git/hooks/commit-msg	2015-04-14 21:01:14.481125000 +0300
> @@ -37,6 +37,8 @@
>     at_sign=`LC_ALL=en_US.UTF-8 $awk "$print_at_sign" </dev/null 2>/dev/null`
>     if test "$at_sign" = @; then
>       LC_ALL=en_US.UTF-8; export LC_ALL
> +  else
> +    LC_ALL=C; export LC_ALL
>     fi
>   fi
>   

The duplicate 'export LC_ALL's can be hoisted out of the 'if'.
> +    non_print = "[\1-\37\177]|\302[\200-\237]|\355([\240-\257]|[\260-\277])[\200-\277]"
>   
>   

"([\240-\257]|[\260-\277])" can be simplified to "[\240-\277]".



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: commit-msg hook
  2015-04-14 18:32                               ` Paul Eggert
@ 2015-04-14 18:59                                 ` Eli Zaretskii
  0 siblings, 0 replies; 27+ messages in thread
From: Eli Zaretskii @ 2015-04-14 18:59 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

> Date: Tue, 14 Apr 2015 11:32:51 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> CC: emacs-devel@gnu.org
> 
> On 04/14/2015 11:01 AM, Eli Zaretskii wrote:
> > You mean, like below?
> 
> Looks good, yes.  Some nits:
> 
> >
> > --- ./.git/hooks/commit-msg.~5~	2015-04-12 19:11:27.481125000 +0300
> > +++ ./.git/hooks/commit-msg	2015-04-14 21:01:14.481125000 +0300
> > @@ -37,6 +37,8 @@
> >     at_sign=`LC_ALL=en_US.UTF-8 $awk "$print_at_sign" </dev/null 2>/dev/null`
> >     if test "$at_sign" = @; then
> >       LC_ALL=en_US.UTF-8; export LC_ALL
> > +  else
> > +    LC_ALL=C; export LC_ALL
> >     fi
> >   fi
> >   
> 
> The duplicate 'export LC_ALL's can be hoisted out of the 'if'.
> > +    non_print = "[\1-\37\177]|\302[\200-\237]|\355([\240-\257]|[\260-\277])[\200-\277]"
> >   
> >   
> 
> "([\240-\257]|[\260-\277])" can be simplified to "[\240-\277]".

Thanks, I made those changes and pushed.



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2015-04-14 18:59 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-10 10:43 commit-msg hook Eli Zaretskii
2015-04-10 18:23 ` Johan Bockgård
2015-04-11  2:42 ` Paul Eggert
2015-04-11  7:24   ` Eli Zaretskii
2015-04-11  9:55     ` Eli Zaretskii
2015-04-11  9:59       ` Eli Zaretskii
2015-04-11 12:42         ` Dmitry Gutov
2015-04-11 14:29           ` Eli Zaretskii
2015-04-11 15:13             ` Dmitry Gutov
2015-04-11 15:17               ` Eli Zaretskii
2015-04-12  3:36                 ` Stefan Monnier
2015-04-12 18:54                   ` chad
2015-04-11 15:40       ` Paul Eggert
2015-04-11 16:40         ` Eli Zaretskii
2015-04-11 20:09           ` Paul Eggert
2015-04-12 16:10             ` Eli Zaretskii
2015-04-13 15:48             ` Eli Zaretskii
2015-04-13 18:37               ` Paul Eggert
2015-04-13 20:18                 ` Eli Zaretskii
2015-04-13 21:19                   ` Paul Eggert
2015-04-14 15:08                     ` Eli Zaretskii
2015-04-14 17:01                       ` Paul Eggert
2015-04-14 17:09                         ` Eli Zaretskii
2015-04-14 17:42                           ` Paul Eggert
2015-04-14 18:01                             ` Eli Zaretskii
2015-04-14 18:32                               ` Paul Eggert
2015-04-14 18:59                                 ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).