* commit-msg hook @ 2015-04-10 10:43 Eli Zaretskii 2015-04-10 18:23 ` Johan Bockgård 2015-04-11 2:42 ` Paul Eggert 0 siblings, 2 replies; 27+ messages in thread From: Eli Zaretskii @ 2015-04-10 10:43 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel On MS-Windows, this hook fails commits with valid UTF-8 characters in the log messages, probably because the version of Gawk I have here is too old (3.0.4; that's what MSYS comes with). The mawk 1.3.3 replacement regexp for non_print does work, so I suggest to have a Gawk version check in the script, and use the replacement for older Gawk's. Btw, when the hook is run by Git, the "; see 'CONTRIBUTE'" part of the message is not shown; could it be that the semi-colon gets interpreted by the shell, or something like that? (My Git version is 1.9.5.) I would also suggest to say explicitly in the message that the commit was aborted, because it is not at all clear, at least to the uninitiated. It wasn't until I did a "git show" that the fact the commit was aborted became apparent to me. Thanks. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-10 10:43 commit-msg hook Eli Zaretskii @ 2015-04-10 18:23 ` Johan Bockgård 2015-04-11 2:42 ` Paul Eggert 1 sibling, 0 replies; 27+ messages in thread From: Johan Bockgård @ 2015-04-10 18:23 UTC (permalink / raw) To: emacs-devel Also, the hook should ignore the stuff after # ------------------------ >8 ------------------------ # Do not touch the line above. # Everything below will be removed. when running the checks on the commit message. (Inserted by commit -v.) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-10 10:43 commit-msg hook Eli Zaretskii 2015-04-10 18:23 ` Johan Bockgård @ 2015-04-11 2:42 ` Paul Eggert 2015-04-11 7:24 ` Eli Zaretskii 1 sibling, 1 reply; 27+ messages in thread From: Paul Eggert @ 2015-04-11 2:42 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Johan Bockgård, emacs-devel Eli Zaretskii wrote: > On MS-Windows, this hook fails commits with valid UTF-8 characters in > the log messages, probably because the version of Gawk I have here is > too old (3.0.4; that's what MSYS comes with). Ouch. That's old. Why is MSYS still using a circa-1999 gawk? Can we talk the MSYS folks into having something more up-to-date? > The mawk 1.3.3 replacement regexp for non_print does work, so I > suggest to have a Gawk version check in the script It's better to check for features rather than versions. I got a copy of Gawk 3.0.4 out of mothballs and hacked for a while until I got something that worked for me. > Btw, when the hook is run by Git, the "; see 'CONTRIBUTE'" part of the > message is not shown; could it be that the semi-colon gets interpreted > by the shell, or something like that? (My Git version is 1.9.5.) Most likely you've got an obsolete hook. Please try "git pull" followed by "./autogen.sh". > I would also suggest to say explicitly in the message that the commit > was aborted, because it is not at all clear Thanks, good suggestion. Johan Bockgård wrote: > Also, the hook should ignore the stuff after > > # ------------------------ >8 ------------------------ > # Do not touch the line above. > # Everything below will be removed. > > when running the checks on the commit message. (Inserted by commit -v.) Thanks, I had forgotten about commit -v. I installed some patches to try to fix the above problems (assuming my guess is right about the obsolete hook). You can do an "./autogen.sh" after pulling these patches. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-11 2:42 ` Paul Eggert @ 2015-04-11 7:24 ` Eli Zaretskii 2015-04-11 9:55 ` Eli Zaretskii 0 siblings, 1 reply; 27+ messages in thread From: Eli Zaretskii @ 2015-04-11 7:24 UTC (permalink / raw) To: Paul Eggert; +Cc: bojohan, emacs-devel > Date: Fri, 10 Apr 2015 19:42:53 -0700 > From: Paul Eggert <eggert@cs.ucla.edu> > CC: emacs-devel@gnu.org, Johan Bockgård > <bojohan@gnu.org> > > Eli Zaretskii wrote: > > On MS-Windows, this hook fails commits with valid UTF-8 characters in > > the log messages, probably because the version of Gawk I have here is > > too old (3.0.4; that's what MSYS comes with). > > Ouch. That's old. Why is MSYS still using a circa-1999 gawk? Can we talk the > MSYS folks into having something more up-to-date? Actually, I see now that this old Gawk comes with the Windows port of Git (which includes part of [an older] MSYS). The official MSYS distribution, by contrast, comes with Gawk 3.1.7, which is 10 years older (2009), but it still doesn't support [[:print:]]. When was that introduced? Hmm... I now tried a native Windows Gawk, and there even Gawk 3.1.4, the oldest one I still have, works with [[:print:]]. So I think the problem is not with [[:print:]] support per se, but instead with UTF-8 multibyte characters in the MSYS build of Gawk (the Windows port of Git also uses MSYS Gawk, just from an older MSYS, it seems). I think MSYS only supports non-ASCII characters encoded in codepage 1252. > It's better to check for features rather than versions. I got a copy of Gawk > 3.0.4 out of mothballs and hacked for a while until I got something that worked > for me. Thanks, this would be even better. However, the new hook hangs on Windows, in this line: print_at_sign='{print substr("'$cent_sign'@", 2)}' It hangs with both MSYS and native Gawk, including the native build of the latest Gawk 4.1.1c. I don't think that UTF-8 will work reliabloy in MS-Windows Gawk, certainly not with Gawk that comes with msysGit. So it sounds like on MS-Windows we are stuck with the "approximate" method of detecting non-printable characters. One way of detecting MS-Windows from a shell script is to see whether the environment variable MSYSTEM is defined. > > Btw, when the hook is run by Git, the "; see 'CONTRIBUTE'" part of the > > message is not shown; could it be that the semi-colon gets interpreted > > by the shell, or something like that? (My Git version is 1.9.5.) > > Most likely you've got an obsolete hook. Please try "git pull" followed by > "./autogen.sh". Is it possible to install the hooks in GNUMakefile, so that running autogen.sh manually wouldn't be needed? I normally almost never run autogen.sh, I always just run "make" from the top-level directory. Thanks. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-11 7:24 ` Eli Zaretskii @ 2015-04-11 9:55 ` Eli Zaretskii 2015-04-11 9:59 ` Eli Zaretskii 2015-04-11 15:40 ` Paul Eggert 0 siblings, 2 replies; 27+ messages in thread From: Eli Zaretskii @ 2015-04-11 9:55 UTC (permalink / raw) To: eggert; +Cc: emacs-devel > Date: Sat, 11 Apr 2015 10:24:05 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: bojohan@gnu.org, emacs-devel@gnu.org > > However, the new hook hangs on Windows, in this line: > > print_at_sign='{print substr("'$cent_sign'@", 2)}' > > It hangs with both MSYS and native Gawk, including the native build of > the latest Gawk 4.1.1c. > > I don't think that UTF-8 will work reliabloy in MS-Windows Gawk, > certainly not with Gawk that comes with msysGit. Digging a bit deeper, I see that the problem with hanging is not due to Gawk, but instead due to 'printf': it doesn't produce the UTF-8 encoding of the percent sign, but some invalid sequence instead. If I add an explicit cent_sign = "¢" to the Awk part of the script (where the cent sign is UTF-8 encoded), then the hook correctly processes log messages with UTF-8 encoded non-ASCII letters and doesn't hang. It doesn't use the [[:print:]] alternative, though, probably because UTF-8 encoded characters are not recognized as such, and a UTF-8 locale cannot be used on MS-Windows to force that. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-11 9:55 ` Eli Zaretskii @ 2015-04-11 9:59 ` Eli Zaretskii 2015-04-11 12:42 ` Dmitry Gutov 2015-04-11 15:40 ` Paul Eggert 1 sibling, 1 reply; 27+ messages in thread From: Eli Zaretskii @ 2015-04-11 9:59 UTC (permalink / raw) To: eggert; +Cc: emacs-devel > Date: Sat, 11 Apr 2015 12:55:50 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: emacs-devel@gnu.org > > Digging a bit deeper, I see that the problem with hanging is not due > to Gawk, but instead due to 'printf': it doesn't produce the UTF-8 > encoding of the percent sign, but some invalid sequence instead. > > If I add an explicit > > cent_sign = "¢" > > to the Awk part of the script (where the cent sign is UTF-8 encoded), > then the hook correctly processes log messages with UTF-8 encoded > non-ASCII letters and doesn't hang. It doesn't use the [[:print:]] > alternative, though, probably because UTF-8 encoded characters are not > recognized as such, and a UTF-8 locale cannot be used on MS-Windows to > force that. Would it be too expensive to use Emacs for detecting unprintable characters, instead of the shell and Awk? All those issues with handling UTF-8 encoded characters correctly and portability issues will then simply disappear. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-11 9:59 ` Eli Zaretskii @ 2015-04-11 12:42 ` Dmitry Gutov 2015-04-11 14:29 ` Eli Zaretskii 0 siblings, 1 reply; 27+ messages in thread From: Dmitry Gutov @ 2015-04-11 12:42 UTC (permalink / raw) To: Eli Zaretskii, eggert; +Cc: emacs-devel On 04/11/2015 12:59 PM, Eli Zaretskii wrote: > Would it be too expensive to use Emacs for detecting unprintable > characters, instead of the shell and Awk? All those issues with > handling UTF-8 encoded characters correctly and portability issues > will then simply disappear. There's no guarantee that there is an Emacs executable in PATH, nor that src/emacs is compiled and working. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-11 12:42 ` Dmitry Gutov @ 2015-04-11 14:29 ` Eli Zaretskii 2015-04-11 15:13 ` Dmitry Gutov 0 siblings, 1 reply; 27+ messages in thread From: Eli Zaretskii @ 2015-04-11 14:29 UTC (permalink / raw) To: Dmitry Gutov; +Cc: eggert, emacs-devel > Date: Sat, 11 Apr 2015 15:42:49 +0300 > From: Dmitry Gutov <dgutov@yandex.ru> > CC: emacs-devel@gnu.org > > On 04/11/2015 12:59 PM, Eli Zaretskii wrote: > > > Would it be too expensive to use Emacs for detecting unprintable > > characters, instead of the shell and Awk? All those issues with > > handling UTF-8 encoded characters correctly and portability issues > > will then simply disappear. > > There's no guarantee that there is an Emacs executable in PATH, nor that > src/emacs is compiled and working. On Emacs developer's machine? Blasphemy! (We only need _some_ Emacs, not the latest one.) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-11 14:29 ` Eli Zaretskii @ 2015-04-11 15:13 ` Dmitry Gutov 2015-04-11 15:17 ` Eli Zaretskii 0 siblings, 1 reply; 27+ messages in thread From: Dmitry Gutov @ 2015-04-11 15:13 UTC (permalink / raw) To: Eli Zaretskii; +Cc: eggert, emacs-devel On 04/11/2015 05:29 PM, Eli Zaretskii wrote: >> There's no guarantee that there is an Emacs executable in PATH, nor that >> src/emacs is compiled and working. > > On Emacs developer's machine? Blasphemy! I do have older Emacs versions built, but they're not installed globally. > (We only need _some_ Emacs, not the latest one.) The one that would be in PATH is the master build, but if it breaks, I'd still like to be able to commit... I think. Overall, I think porting the whole commit-msg script from awk (scary!) to Elisp would be nice, but it will create this new complication. At the very least, it will involve a new configuration step, to somehow tell Git which Emacs build to use. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-11 15:13 ` Dmitry Gutov @ 2015-04-11 15:17 ` Eli Zaretskii 2015-04-12 3:36 ` Stefan Monnier 0 siblings, 1 reply; 27+ messages in thread From: Eli Zaretskii @ 2015-04-11 15:17 UTC (permalink / raw) To: Dmitry Gutov; +Cc: eggert, emacs-devel > Date: Sat, 11 Apr 2015 18:13:31 +0300 > From: Dmitry Gutov <dgutov@yandex.ru> > CC: eggert@cs.ucla.edu, emacs-devel@gnu.org > > > (We only need _some_ Emacs, not the latest one.) > > The one that would be in PATH is the master build, but if it breaks, I'd > still like to be able to commit... I think. How is this different from a broken Gawk? > Overall, I think porting the whole commit-msg script from awk (scary!) > to Elisp would be nice, but it will create this new complication. I'm not sure the complication is new. I think we can rely on Emacs being available no less than we rely on Gawk and Perl nowadays. > At the very least, it will involve a new configuration step, to somehow > tell Git which Emacs build to use. I think we could just rely on its being on PATH. All we need is Emacs 23.1 or later. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-11 15:17 ` Eli Zaretskii @ 2015-04-12 3:36 ` Stefan Monnier 2015-04-12 18:54 ` chad 0 siblings, 1 reply; 27+ messages in thread From: Stefan Monnier @ 2015-04-12 3:36 UTC (permalink / raw) To: Eli Zaretskii; +Cc: eggert, emacs-devel, Dmitry Gutov > I think we could just rely on its being on PATH. All we need is Emacs > 23.1 or later. FWIW, I think that Emacs>=23 is rarely in PATH under Mac OS X. Stefan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-12 3:36 ` Stefan Monnier @ 2015-04-12 18:54 ` chad 0 siblings, 0 replies; 27+ messages in thread From: chad @ 2015-04-12 18:54 UTC (permalink / raw) To: emacs-devel > On 11 Apr 2015, at 20:36, Stefan Monnier <monnier@iro.umontreal.ca> wrote: > >> I think we could just rely on its being on PATH. All we need is Emacs >> 23.1 or later. > > FWIW, I think that Emacs>=23 is rarely in PATH under Mac OS X. Stefan is correct. The shipped-with default emacs on the latest macosx is: GNU Emacs 22.1.1 (mac-apple-darwin) of 2015-03-23 on osx202.apple.com I believe that the majority of users who install a newer version will NOT end up with it in PATH. ~Chad ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-11 9:55 ` Eli Zaretskii 2015-04-11 9:59 ` Eli Zaretskii @ 2015-04-11 15:40 ` Paul Eggert 2015-04-11 16:40 ` Eli Zaretskii 1 sibling, 1 reply; 27+ messages in thread From: Paul Eggert @ 2015-04-11 15:40 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel [-- Attachment #1: Type: text/plain, Size: 1602 bytes --] Eli Zaretskii wrote: > the problem with hanging is not due > to Gawk, but instead due to 'printf': it doesn't produce the UTF-8 > encoding of the percent sign OK, so we're dealing with a broken MS-Windows shell rather than a broken MS-Windows gawk. That 'printf' command is there only so that the script can be pure ASCII, and it's simpler to avoid that nicety. I installed the attached patch to try to work around the shell bug. As regards the idea of using Emacs rather than the shell, I'm with Dmitry: let's not assume Emacs is already working when bootstrapping a new Emacs. Instead, let's merely assume that a new developer has a working shell, since that's needed anyway for other reasons. > Is it possible to install the hooks in GNUMakefile, so that running > autogen.sh manually wouldn't be needed? I normally almost never run > autogen.sh, I always just run "make" from the top-level directory. I'm leery of that. GNUMakefile is run by everybody, not just by developers. And there are good reasons that the Git hooks are not themselves in the Git repository (if they were in the repository, and they were sufficiently messed up, then one couldn't 'git commit' fixes). If we routinely updated the Git hooks automatically, we'd make it more likely that Git non-experts would put their painfully constructed repositories into a "broken" state, and we already have too much panic in that neighborhood. That being said, it might make sense to separate the Git hook (re)initialization into an autogen.sh option, since it's cheap whereas autogen.sh by default is expensive. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-Port-commit-msg-to-broken-MS-Windows-shell.patch --] [-- Type: text/x-patch; name="0001-Port-commit-msg-to-broken-MS-Windows-shell.patch", Size: 1570 bytes --] From 2cc9a32d361492f28faeced22e3800eb1a59c252 Mon Sep 17 00:00:00 2001 From: Paul Eggert <eggert@cs.ucla.edu> Date: Sat, 11 Apr 2015 08:19:13 -0700 Subject: [PATCH] Port commit-msg to broken MS-Windows shell * build-aux/git-hooks/commit-msg (cent_sign): Just use UTF-8 here rather than ASCII + printf, as the latter fails on a broken MS-Windows shell. Reported by Eli Zaretskii in: http://lists.gnu.org/archive/html/emacs-devel/2015-04/msg00592.html --- build-aux/git-hooks/commit-msg | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/build-aux/git-hooks/commit-msg b/build-aux/git-hooks/commit-msg index ea8d909..3fc6e19 100755 --- a/build-aux/git-hooks/commit-msg +++ b/build-aux/git-hooks/commit-msg @@ -29,8 +29,7 @@ fi # Use a UTF-8 locale if available, so that the UTF-8 check works. # Use U+00A2 CENT SIGN to test whether the locale works. -cent_sign_utf8_format='\302\242\n' -cent_sign=`printf "$cent_sign_utf8_format"` +cent_sign='¢' print_at_sign='{print substr("'$cent_sign'@", 2)}' at_sign=`$awk "$print_at_sign" 2>/dev/null` if test "$at_sign" != @; then @@ -45,7 +44,7 @@ exec $awk -v at_sign="$at_sign" -v cent_sign="$cent_sign" ' BEGIN { # These regular expressions assume traditional Unix unibyte behavior. # They are needed for old or broken versions of awk, e.g., - # mawk 1.3.3 (1996), Gawk 3.0.4 (1999). + # mawk 1.3.3 (1996), or gawk on MSYS (2015). space = "[ \f\n\r\t\v]" non_space = "[^ \f\n\r\t\v]" non_print = "[\1-\37\177]" -- 2.1.0 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-11 15:40 ` Paul Eggert @ 2015-04-11 16:40 ` Eli Zaretskii 2015-04-11 20:09 ` Paul Eggert 0 siblings, 1 reply; 27+ messages in thread From: Eli Zaretskii @ 2015-04-11 16:40 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel > Date: Sat, 11 Apr 2015 08:40:29 -0700 > From: Paul Eggert <eggert@cs.ucla.edu> > CC: emacs-devel@gnu.org > > Eli Zaretskii wrote: > > the problem with hanging is not due > > to Gawk, but instead due to 'printf': it doesn't produce the UTF-8 > > encoding of the percent sign > > OK, so we're dealing with a broken MS-Windows shell rather than a broken > MS-Windows gawk. That 'printf' command is there only so that the script can be > pure ASCII, and it's simpler to avoid that nicety. I installed the attached > patch to try to work around the shell bug. That hangs as well, and I found out why: Gawk waits for input. (I don't know why this doesn't work as you intended, seems like another incompatibility in MSYS Bash.) To fix, replace this line print_at_sign='{print substr("'$cent_sign'@", 2)}' with this: print_at_sign='BEGIN {print substr("'$cent_sign'@", 2)}' and then even your previous version works. Sorry for my imperfect debugging, there are too many potentially problematic constructs in this script, and it's not easy to understand what exactly misfires, especially when Bash shows me corrupted non-ASCII strings. Thanks. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-11 16:40 ` Eli Zaretskii @ 2015-04-11 20:09 ` Paul Eggert 2015-04-12 16:10 ` Eli Zaretskii 2015-04-13 15:48 ` Eli Zaretskii 0 siblings, 2 replies; 27+ messages in thread From: Paul Eggert @ 2015-04-11 20:09 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel [-- Attachment #1: Type: text/plain, Size: 429 bytes --] Eli Zaretskii wrote: > That hangs as well, and I found out why: Gawk waits for input. Ah, that could be due to an incompatibility with the MSYS shell; but the BEGIN can't hurt, and I installed the attached patch to redirect from stdin as well. Since this appears to be the real problem, I'd rather go back to the all-ASCII script, for portability to pre-POSIX shells (probably doesn't matter nowadays, but shouldn't hurt). [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-Port-commit-msg-to-MSYS-Bash-Gawk.patch --] [-- Type: text/x-patch; name="0001-Port-commit-msg-to-MSYS-Bash-Gawk.patch", Size: 1549 bytes --] From ad3a0eac324a7163ab0e345e60f741e2d55087b9 Mon Sep 17 00:00:00 2001 From: Paul Eggert <eggert@cs.ucla.edu> Date: Sat, 11 Apr 2015 13:04:37 -0700 Subject: [PATCH] Port commit-msg to MSYS Bash+Gawk See Eli Zaretskii in: http://lists.gnu.org/archive/html/emacs-devel/2015-04/msg00610.html * build-aux/git-hooks/commit-msg (cent_sign_utf8_format) (cent_sign, print_at_sign, at_sign): Revert previous change. (print_at_sign): Prepend "BEGIN". (at_sign): Redirect from /dev/null to be safer with pre-POSIX awk. --- build-aux/git-hooks/commit-msg | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/build-aux/git-hooks/commit-msg b/build-aux/git-hooks/commit-msg index 3fc6e19..6e31dbc 100755 --- a/build-aux/git-hooks/commit-msg +++ b/build-aux/git-hooks/commit-msg @@ -29,11 +29,12 @@ fi # Use a UTF-8 locale if available, so that the UTF-8 check works. # Use U+00A2 CENT SIGN to test whether the locale works. -cent_sign='¢' -print_at_sign='{print substr("'$cent_sign'@", 2)}' -at_sign=`$awk "$print_at_sign" 2>/dev/null` +cent_sign_utf8_format='\302\242\n' +cent_sign=`printf "$cent_sign_utf8_format"` +print_at_sign='BEGIN {print substr("'$cent_sign'@", 2)}' +at_sign=`$awk "$print_at_sign" </dev/null 2>/dev/null` if test "$at_sign" != @; then - at_sign=`LC_ALL=en_US.UTF-8 $awk "$print_at_sign" 2>/dev/null` + at_sign=`LC_ALL=en_US.UTF-8 $awk "$print_at_sign" </dev/null 2>/dev/null` if test "$at_sign" = @; then LC_ALL=en_US.UTF-8; export LC_ALL fi -- 2.1.0 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-11 20:09 ` Paul Eggert @ 2015-04-12 16:10 ` Eli Zaretskii 2015-04-13 15:48 ` Eli Zaretskii 1 sibling, 0 replies; 27+ messages in thread From: Eli Zaretskii @ 2015-04-12 16:10 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel > Date: Sat, 11 Apr 2015 13:09:12 -0700 > From: Paul Eggert <eggert@cs.ucla.edu> > CC: emacs-devel@gnu.org > > Eli Zaretskii wrote: > > That hangs as well, and I found out why: Gawk waits for input. > > Ah, that could be due to an incompatibility with the MSYS shell; but the BEGIN > can't hurt, and I installed the attached patch to redirect from stdin as well. > Since this appears to be the real problem, I'd rather go back to the all-ASCII > script, for portability to pre-POSIX shells (probably doesn't matter nowadays, > but shouldn't hurt). Thanks, this version works for me. Once again, sorry for not detecting the real culprit sooner. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-11 20:09 ` Paul Eggert 2015-04-12 16:10 ` Eli Zaretskii @ 2015-04-13 15:48 ` Eli Zaretskii 2015-04-13 18:37 ` Paul Eggert 1 sibling, 1 reply; 27+ messages in thread From: Eli Zaretskii @ 2015-04-13 15:48 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel > Date: Sat, 11 Apr 2015 13:09:12 -0700 > From: Paul Eggert <eggert@cs.ucla.edu> > Cc: emacs-devel@gnu.org > > Since this appears to be the real problem, I'd rather go back to the all-ASCII > script, for portability to pre-POSIX shells (probably doesn't matter nowadays, > but shouldn't hurt). Actually, I've been thinking: why do we need to rely on system libraries to implement UTF-8 and [:print:] correctly? According to the Unicode Standard, [:print:] should reject only a small number of special characters: . control characters between 0x01 and 0x1f and 0x7f to 0x9f . surrogates . unassigned codepoints I think we should not detect unassigned codepoints, since hundreds of them are assigned every year, so we are likely to trigger false positives. As for the other two groups, it should be quite easy to detect their UTF-8 sequences with relatively simple regular expressions, without relying on system libraries or up-to-date Gawk/whatever, if we assume that commit log messages must always be UTF-8 encoded. WDYT? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-13 15:48 ` Eli Zaretskii @ 2015-04-13 18:37 ` Paul Eggert 2015-04-13 20:18 ` Eli Zaretskii 0 siblings, 1 reply; 27+ messages in thread From: Paul Eggert @ 2015-04-13 18:37 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel On 04/13/2015 08:48 AM, Eli Zaretskii wrote: > why do we need to rely on system > libraries to implement UTF-8 and [:print:] correctly? Some POSIXish environments do not support unibyte locales; even the C locale is multibyte (it uses UTF-8). In such environments, I expect that a regular expression like /\342\202\254/ won't necessarily match the string "€" (U+20AC, equivalent to the byte string "\342\202\254") in any locale, just as in Emacs the call (re-search-forward "\342\202\254") won't find the string "€" in a UTF-8 file, regardless of language settings. I suppose the script could detect whether we're in such an environment, but I dunno, it sounds like more trouble than it's worth. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-13 18:37 ` Paul Eggert @ 2015-04-13 20:18 ` Eli Zaretskii 2015-04-13 21:19 ` Paul Eggert 0 siblings, 1 reply; 27+ messages in thread From: Eli Zaretskii @ 2015-04-13 20:18 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel > Date: Mon, 13 Apr 2015 11:37:35 -0700 > From: Paul Eggert <eggert@cs.ucla.edu> > CC: emacs-devel@gnu.org > > On 04/13/2015 08:48 AM, Eli Zaretskii wrote: > > why do we need to rely on system > > libraries to implement UTF-8 and [:print:] correctly? > Some POSIXish environments do not support unibyte locales; even the C > locale is multibyte (it uses UTF-8). Gawk has the --characters-as-bytes option since v4.0.0, which should countermand that, I think. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-13 20:18 ` Eli Zaretskii @ 2015-04-13 21:19 ` Paul Eggert 2015-04-14 15:08 ` Eli Zaretskii 0 siblings, 1 reply; 27+ messages in thread From: Paul Eggert @ 2015-04-13 21:19 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel On 04/13/2015 01:18 PM, Eli Zaretskii wrote: > Gawk has the --characters-as-bytes option since v4.0.0, which should > countermand that, I think. Sure, although the code should work even plain POSIX awk, as there should be no need to assume such a GNU extension when bootstrapping. That is, the script could support either: 1. POSIX awk with multibyte OS support, with proper UTF-8 checking from OS libraries; or 2. GNU awk 4 (2012) or later, with nearly-as-good UTF-8 checking hand-coded into the script; or 3. Traditional awk without UTF-8 checking. Currently the script supports (1) and (3) but someone could add support for (2). ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-13 21:19 ` Paul Eggert @ 2015-04-14 15:08 ` Eli Zaretskii 2015-04-14 17:01 ` Paul Eggert 0 siblings, 1 reply; 27+ messages in thread From: Eli Zaretskii @ 2015-04-14 15:08 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel > Date: Mon, 13 Apr 2015 14:19:51 -0700 > From: Paul Eggert <eggert@cs.ucla.edu> > CC: emacs-devel@gnu.org > > On 04/13/2015 01:18 PM, Eli Zaretskii wrote: > > Gawk has the --characters-as-bytes option since v4.0.0, which should > > countermand that, I think. > > Sure, although the code should work even plain POSIX awk, as there > should be no need to assume such a GNU extension when bootstrapping. > That is, the script could support either: > > 1. POSIX awk with multibyte OS support, with proper UTF-8 checking from > OS libraries; or > > 2. GNU awk 4 (2012) or later, with nearly-as-good UTF-8 checking > hand-coded into the script; or > > 3. Traditional awk without UTF-8 checking. > > Currently the script supports (1) and (3) but someone could add support > for (2). How about the following change? It improves on (3), and worked for me both on MS-Windows and on GNU/Linux. --- ./.git/hooks/commit-msg.~5~ 2015-04-12 19:11:27.481125000 +0300 +++ ./.git/hooks/commit-msg 2015-04-14 11:11:02.000000000 +0300 @@ -45,10 +45,13 @@ BEGIN { # These regular expressions assume traditional Unix unibyte behavior. # They are needed for old or broken versions of awk, e.g., - # mawk 1.3.3 (1996), or gawk on MSYS (2015). + # mawk 1.3.3 (1996), or gawk on MSYS (2015), and/or for systems that + # cannot use UTF-8 as the codeset for the locale. space = "[ \f\n\r\t\v]" non_space = "[^ \f\n\r\t\v]" - non_print = "[\1-\37\177]" + # The non_print below rejects control characters and surrogates + # UTF-8 for: 0x01-0x1f 0x7f 0x80-0x9f 0xd800-0xdbff 0xdc00-0xdfff + non_print = "[\1-\37\177]|\302[\200-\237]|\355([\240-\257]|[\260-\277])[\200-\277]" # Prefer POSIX regular expressions if available, as they do a # better job of checking. Similarly, prefer POSIX negated ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-14 15:08 ` Eli Zaretskii @ 2015-04-14 17:01 ` Paul Eggert 2015-04-14 17:09 ` Eli Zaretskii 0 siblings, 1 reply; 27+ messages in thread From: Paul Eggert @ 2015-04-14 17:01 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel On 04/14/2015 08:08 AM, Eli Zaretskii wrote: > + non_print = "[\1-\37\177]|\302[\200-\237]|\355([\240-\257]|[\260-\277])[\200-\277]" This sort of thing should work in a unibyte environment, but it needs to be used only after testing that we actually are in a unibyte environment. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-14 17:01 ` Paul Eggert @ 2015-04-14 17:09 ` Eli Zaretskii 2015-04-14 17:42 ` Paul Eggert 0 siblings, 1 reply; 27+ messages in thread From: Eli Zaretskii @ 2015-04-14 17:09 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel > Date: Tue, 14 Apr 2015 10:01:12 -0700 > From: Paul Eggert <eggert@cs.ucla.edu> > CC: emacs-devel@gnu.org > > On 04/14/2015 08:08 AM, Eli Zaretskii wrote: > > + non_print = "[\1-\37\177]|\302[\200-\237]|\355([\240-\257]|[\260-\277])[\200-\277]" > > This sort of thing should work in a unibyte environment, but it needs to > be used only after testing that we actually are in a unibyte environment. I thought that's what all the tests with cent_sign and at_sign do, don't they? Anyway, what bad things could happen if this regular expression is used in a multibyte environment? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-14 17:09 ` Eli Zaretskii @ 2015-04-14 17:42 ` Paul Eggert 2015-04-14 18:01 ` Eli Zaretskii 0 siblings, 1 reply; 27+ messages in thread From: Paul Eggert @ 2015-04-14 17:42 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel On 04/14/2015 10:09 AM, Eli Zaretskii wrote: >> This sort of thing should work in a unibyte environment, but it needs to >> >be used only after testing that we actually are in a unibyte environment. > I thought that's what all the tests with cent_sign and at_sign do, > don't they? No, they test for something more specific, namely, whether we're in a UTF-8 locale. Not every multibyte locale uses UTF-8. > what bad things could happen if this regular expression is > used in a multibyte environment? I suppose it could cause the script to print "Unprintable character in commit message" even though all the message's characters are actually printable. How about this idea? Before falling back to the unibyte regular expressions in awk, set LC_ALL='C' in the environment. This should work well enough, as in practice all environments where the C locale is multibyte have working UTF-8 so they won't need to fall back to unibyte anyway. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-14 17:42 ` Paul Eggert @ 2015-04-14 18:01 ` Eli Zaretskii 2015-04-14 18:32 ` Paul Eggert 0 siblings, 1 reply; 27+ messages in thread From: Eli Zaretskii @ 2015-04-14 18:01 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel > Date: Tue, 14 Apr 2015 10:42:53 -0700 > From: Paul Eggert <eggert@cs.ucla.edu> > CC: emacs-devel@gnu.org > > How about this idea? Before falling back to the unibyte regular > expressions in awk, set LC_ALL='C' in the environment. This should work > well enough, as in practice all environments where the C locale is > multibyte have working UTF-8 so they won't need to fall back to unibyte > anyway. You mean, like below? --- ./.git/hooks/commit-msg.~5~ 2015-04-12 19:11:27.481125000 +0300 +++ ./.git/hooks/commit-msg 2015-04-14 21:01:14.481125000 +0300 @@ -37,6 +37,8 @@ at_sign=`LC_ALL=en_US.UTF-8 $awk "$print_at_sign" </dev/null 2>/dev/null` if test "$at_sign" = @; then LC_ALL=en_US.UTF-8; export LC_ALL + else + LC_ALL=C; export LC_ALL fi fi @@ -45,10 +47,13 @@ BEGIN { # These regular expressions assume traditional Unix unibyte behavior. # They are needed for old or broken versions of awk, e.g., - # mawk 1.3.3 (1996), or gawk on MSYS (2015). + # mawk 1.3.3 (1996), or gawk on MSYS (2015), and/or for systems that + # cannot use UTF-8 as the codeset for the locale. space = "[ \f\n\r\t\v]" non_space = "[^ \f\n\r\t\v]" - non_print = "[\1-\37\177]" + # The non_print below rejects control characters and surrogates + # UTF-8 for: 0x01-0x1f 0x7f 0x80-0x9f 0xd800-0xdbff 0xdc00-0xdfff + non_print = "[\1-\37\177]|\302[\200-\237]|\355([\240-\257]|[\260-\277])[\200-\277]" # Prefer POSIX regular expressions if available, as they do a # better job of checking. Similarly, prefer POSIX negated ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-14 18:01 ` Eli Zaretskii @ 2015-04-14 18:32 ` Paul Eggert 2015-04-14 18:59 ` Eli Zaretskii 0 siblings, 1 reply; 27+ messages in thread From: Paul Eggert @ 2015-04-14 18:32 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel On 04/14/2015 11:01 AM, Eli Zaretskii wrote: > You mean, like below? Looks good, yes. Some nits: > > --- ./.git/hooks/commit-msg.~5~ 2015-04-12 19:11:27.481125000 +0300 > +++ ./.git/hooks/commit-msg 2015-04-14 21:01:14.481125000 +0300 > @@ -37,6 +37,8 @@ > at_sign=`LC_ALL=en_US.UTF-8 $awk "$print_at_sign" </dev/null 2>/dev/null` > if test "$at_sign" = @; then > LC_ALL=en_US.UTF-8; export LC_ALL > + else > + LC_ALL=C; export LC_ALL > fi > fi > The duplicate 'export LC_ALL's can be hoisted out of the 'if'. > + non_print = "[\1-\37\177]|\302[\200-\237]|\355([\240-\257]|[\260-\277])[\200-\277]" > > "([\240-\257]|[\260-\277])" can be simplified to "[\240-\277]". ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: commit-msg hook 2015-04-14 18:32 ` Paul Eggert @ 2015-04-14 18:59 ` Eli Zaretskii 0 siblings, 0 replies; 27+ messages in thread From: Eli Zaretskii @ 2015-04-14 18:59 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel > Date: Tue, 14 Apr 2015 11:32:51 -0700 > From: Paul Eggert <eggert@cs.ucla.edu> > CC: emacs-devel@gnu.org > > On 04/14/2015 11:01 AM, Eli Zaretskii wrote: > > You mean, like below? > > Looks good, yes. Some nits: > > > > > --- ./.git/hooks/commit-msg.~5~ 2015-04-12 19:11:27.481125000 +0300 > > +++ ./.git/hooks/commit-msg 2015-04-14 21:01:14.481125000 +0300 > > @@ -37,6 +37,8 @@ > > at_sign=`LC_ALL=en_US.UTF-8 $awk "$print_at_sign" </dev/null 2>/dev/null` > > if test "$at_sign" = @; then > > LC_ALL=en_US.UTF-8; export LC_ALL > > + else > > + LC_ALL=C; export LC_ALL > > fi > > fi > > > > The duplicate 'export LC_ALL's can be hoisted out of the 'if'. > > + non_print = "[\1-\37\177]|\302[\200-\237]|\355([\240-\257]|[\260-\277])[\200-\277]" > > > > > > "([\240-\257]|[\260-\277])" can be simplified to "[\240-\277]". Thanks, I made those changes and pushed. ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2015-04-14 18:59 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-04-10 10:43 commit-msg hook Eli Zaretskii 2015-04-10 18:23 ` Johan Bockgård 2015-04-11 2:42 ` Paul Eggert 2015-04-11 7:24 ` Eli Zaretskii 2015-04-11 9:55 ` Eli Zaretskii 2015-04-11 9:59 ` Eli Zaretskii 2015-04-11 12:42 ` Dmitry Gutov 2015-04-11 14:29 ` Eli Zaretskii 2015-04-11 15:13 ` Dmitry Gutov 2015-04-11 15:17 ` Eli Zaretskii 2015-04-12 3:36 ` Stefan Monnier 2015-04-12 18:54 ` chad 2015-04-11 15:40 ` Paul Eggert 2015-04-11 16:40 ` Eli Zaretskii 2015-04-11 20:09 ` Paul Eggert 2015-04-12 16:10 ` Eli Zaretskii 2015-04-13 15:48 ` Eli Zaretskii 2015-04-13 18:37 ` Paul Eggert 2015-04-13 20:18 ` Eli Zaretskii 2015-04-13 21:19 ` Paul Eggert 2015-04-14 15:08 ` Eli Zaretskii 2015-04-14 17:01 ` Paul Eggert 2015-04-14 17:09 ` Eli Zaretskii 2015-04-14 17:42 ` Paul Eggert 2015-04-14 18:01 ` Eli Zaretskii 2015-04-14 18:32 ` Paul Eggert 2015-04-14 18:59 ` Eli Zaretskii
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.