unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#27978: Detection of section name in man.el
@ 2017-08-05 23:44 Grégory Mounié
  2017-08-18  8:49 ` Eli Zaretskii
  0 siblings, 1 reply; 3+ messages in thread
From: Grégory Mounié @ 2017-08-05 23:44 UTC (permalink / raw)
  To: 27978

[-- Attachment #1: Type: text/plain, Size: 604 bytes --]


  When parsing manual in languages with non-ascii letters, the section 
names using non-ascii letters are not added to the table of content.

  I noticed the bug reading the French bash manual: the quite useful 
"COMMANDES INTERNES DE l'INTERPRÉTEUR" section does not appear (SHELL 
BUILTIN COMMAND). (because of the É letter)

  I propose to use Character class instead of ascii interval in the 
appropriate regexp defvar. It should not change anything for english 
manual and it should work for many other languages.

  It works great for the bash manual in French.
  Grégory Mounié

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Unicode-support-for-man-section-name-detection.patch --]
[-- Type: text/x-patch; name="0001-Unicode-support-for-man-section-name-detection.patch", Size: 1814 bytes --]

From f9f8b027bcec6fe7aec2c0009eecdcd7e8880292 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Gr=C3=A9gory=20Mouni=C3=A9?= <Gregory.Mounie@imag.fr>
Date: Sun, 6 Aug 2017 01:22:58 +0200
Subject: [PATCH] Unicode support for man section name detection

* lisp/man.el: Replace ascii interval by character class in
order to detect correctly the section names in the table of
content (eg. in the french version of the  bash manual).

Copyright-paperwork-exempt: yes
---
 lisp/man.el | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lisp/man.el b/lisp/man.el
index 0e1c92956b..97a4758e7e 100644
--- a/lisp/man.el
+++ b/lisp/man.el
@@ -278,21 +278,21 @@ Man-cooked-hook
   :type 'hook
   :group 'man)
 
-(defvar Man-name-regexp "[-a-zA-Z0-9_­+][-a-zA-Z0-9_.:­+]*"
+(defvar Man-name-regexp "[-[:alnum:]_­+][-[:alnum:]_.:­+]*"
   "Regular expression describing the name of a manpage (without section).")
 
-(defvar Man-section-regexp "[0-9][a-zA-Z0-9+]*\\|[LNln]"
+(defvar Man-section-regexp "[[:digit:]][[:alnum:]+]*\\|[LNln]"
   "Regular expression describing a manpage section within parentheses.")
 
 (defvar Man-page-header-regexp
   (if (string-match "-solaris2\\." system-configuration)
-      (concat "^[-A-Za-z0-9_].*[ \t]\\(" Man-name-regexp
+      (concat "^[-[:alnum:]_].*[ \t]\\(" Man-name-regexp
 	      "(\\(" Man-section-regexp "\\))\\)$")
     (concat "^[ \t]*\\(" Man-name-regexp
 	    "(\\(" Man-section-regexp "\\))\\).*\\1"))
   "Regular expression describing the heading of a page.")
 
-(defvar Man-heading-regexp "^\\([A-Z][A-Z0-9 /-]+\\)$"
+(defvar Man-heading-regexp "^\\([[:upper:]][[:upper:][:digit:] /-]+\\)$"
   "Regular expression describing a manpage heading entry.")
 
 (defvar Man-see-also-regexp "SEE ALSO"
-- 
2.13.3


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* bug#27978: Detection of section name in man.el
  2017-08-05 23:44 bug#27978: Detection of section name in man.el Grégory Mounié
@ 2017-08-18  8:49 ` Eli Zaretskii
       [not found]   ` <4f29a934-24db-6d10-db27-fd3a3a0c1269@imag.fr>
  0 siblings, 1 reply; 3+ messages in thread
From: Eli Zaretskii @ 2017-08-18  8:49 UTC (permalink / raw)
  To: Grégory Mounié; +Cc: 27978-done

> From: Grégory Mounié
> 	<Gregory.Mounie@imag.fr>
> Date: Sun, 6 Aug 2017 01:44:19 +0200
> 
>   When parsing manual in languages with non-ascii letters, the section 
> names using non-ascii letters are not added to the table of content.
> 
>   I noticed the bug reading the French bash manual: the quite useful 
> "COMMANDES INTERNES DE l'INTERPRÉTEUR" section does not appear (SHELL 
> BUILTIN COMMAND). (because of the É letter)
> 
>   I propose to use Character class instead of ascii interval in the 
> appropriate regexp defvar. It should not change anything for english 
> manual and it should work for many other languages.

Thanks, I pushed these changes with some minor adjustments.
Specifically:

> -(defvar Man-section-regexp "[0-9][a-zA-Z0-9+]*\\|[LNln]"
> +(defvar Man-section-regexp "[[:digit:]][[:alnum:]+]*\\|[LNln]"
>    "Regular expression describing a manpage section within parentheses.")

I didn't change this one, because I think a section always uses only
ASCII letters and numbers, as in ".1n".  If you disagree, can you show
an example where this is not so?

> -(defvar Man-heading-regexp "^\\([A-Z][A-Z0-9 /-]+\\)$"
> +(defvar Man-heading-regexp "^\\([[:upper:]][[:upper:][:digit:] /-]+\\)$"
>    "Regular expression describing a manpage heading entry.")

I see no reason to replace 0-9 with [:digit:] here, since I think
non-ASCII digits will never be used in this context.  Do you agree?

Incidentally, I see quite a few similar regexps elsewhere in man.el,
did you audit all of them and established that they don't need similar
changes?  If not, would you like to propose similar changes there?





^ permalink raw reply	[flat|nested] 3+ messages in thread

* bug#27978: Detection of section name in man.el
       [not found]   ` <4f29a934-24db-6d10-db27-fd3a3a0c1269@imag.fr>
@ 2017-08-18 19:23     ` Eli Zaretskii
  0 siblings, 0 replies; 3+ messages in thread
From: Eli Zaretskii @ 2017-08-18 19:23 UTC (permalink / raw)
  To: Grégory Mounié; +Cc: 27978

[Please keep the bug address on the CC list.]

> From: Grégory Mounié <Gregory.Mounie@imag.fr>
> Date: Fri, 18 Aug 2017 19:53:44 +0200
> 
>   In brief, I would not change the other a-zA-Z regexps (details below).
> 
>   But I would change the SEE ALSO regexp (around line 298) to add other 
> languages. Should I fill another bug report with another patch  ?
> 
> (defvar Man-see-also-regexp "SEE ALSO"
>    "Regular expression for SEE ALSO heading (or your equivalent).
> This regexp should not start with a `^' character.")
> 
>   using the debian manpages translation as référence, and using
>   "zgrep -h SH man*/*  | sort | uniq -c | sort -n" inside appropriate 
> /usr/share/man subdirectories to infer the values, I propose:
> 
>   "SEE ALSO\|VOIR AUSSI\|SIEHE AUCH\|VÉASE TAMBIÉN\|VEJA TAMBÉM\|VEDERE 
> ANCHE\|ZOBACZ TAKŻE\|İLGİLİ BELGELER\|参照|参见 SEE ALSO\|參見 SEE ALSO"
> 
>   (French, German, Spanish, Portugese, Italian, Polish, Turkish, 
> Japanese, Chinese CN, Chinese TW)

OK.  If no one objects, I will make this change soon.  Thanks.

> Details below about the a-zA-Z regexps:
> 
> Le 18/08/2017 à 10:49, Eli Zaretskii a écrit :
> > 
> > Thanks, I pushed these changes with some minor adjustments.
> > Specifically:
> > 
> >> -(defvar Man-section-regexp "[0-9][a-zA-Z0-9+]*\\|[LNln]"
> >> +(defvar Man-section-regexp "[[:digit:]][[:alnum:]+]*\\|[LNln]"
> >>     "Regular expression describing a manpage section within parentheses.")
> > 
> > I didn't change this one, because I think a section always uses only
> > ASCII letters and numbers, as in ".1n".  If you disagree, can you show
> > an example where this is not so?
> > 
> 
>   I have install the various multilingual standard manpages of my debian 
> and I have not grep a counter example so I guess it is perfect.
> 
> >> -(defvar Man-heading-regexp "^\\([A-Z][A-Z0-9 /-]+\\)$"
> >> +(defvar Man-heading-regexp "^\\([[:upper:]][[:upper:][:digit:] /-]+\\)$"
> >>     "Regular expression describing a manpage heading entry.")
> > 
> > I see no reason to replace 0-9 with [:digit:] here, since I think
> > non-ASCII digits will never be used in this context.  Do you agree?
> > 
> > Incidentally, I see quite a few similar regexps elsewhere in man.el,
> > did you audit all of them and established that they don't need similar
> > changes?  If not, would you like to propose similar changes there?
> > 
> 
>   There are 18 a-Z. They seem like a detection carefully crafted by 
> history, thus I would not change them without counter-example either.
> 
>   The first four a-zA-Z seems related to the parsing of external 
> command, with particularities in Windows port so I would not recommend 
> to change it.
>   The 5-18 a-zA-Z try to guess the manpage around POS. The main pattern
>   is "-a-zA-Z0-9._+:"
> 
>   With the same set of multi-lingual manpages, I have found only one 
> character used in manpage name and not in the set: "[" (man [ leads you 
> to test). I suspect that adding "[" would add more regressions than 
> solutions.
> 
>   Note that line 720 the pattern is slightly different (missing "-._:"). 
> I do not understand really why.





^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-08-18 19:23 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-05 23:44 bug#27978: Detection of section name in man.el Grégory Mounié
2017-08-18  8:49 ` Eli Zaretskii
     [not found]   ` <4f29a934-24db-6d10-db27-fd3a3a0c1269@imag.fr>
2017-08-18 19:23     ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).