* bug#27978: Detection of section name in man.el @ 2017-08-05 23:44 Grégory Mounié 2017-08-18 8:49 ` Eli Zaretskii 0 siblings, 1 reply; 3+ messages in thread From: Grégory Mounié @ 2017-08-05 23:44 UTC (permalink / raw) To: 27978 [-- Attachment #1: Type: text/plain, Size: 604 bytes --] When parsing manual in languages with non-ascii letters, the section names using non-ascii letters are not added to the table of content. I noticed the bug reading the French bash manual: the quite useful "COMMANDES INTERNES DE l'INTERPRÉTEUR" section does not appear (SHELL BUILTIN COMMAND). (because of the É letter) I propose to use Character class instead of ascii interval in the appropriate regexp defvar. It should not change anything for english manual and it should work for many other languages. It works great for the bash manual in French. Grégory Mounié [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-Unicode-support-for-man-section-name-detection.patch --] [-- Type: text/x-patch; name="0001-Unicode-support-for-man-section-name-detection.patch", Size: 1814 bytes --] From f9f8b027bcec6fe7aec2c0009eecdcd7e8880292 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Gr=C3=A9gory=20Mouni=C3=A9?= <Gregory.Mounie@imag.fr> Date: Sun, 6 Aug 2017 01:22:58 +0200 Subject: [PATCH] Unicode support for man section name detection * lisp/man.el: Replace ascii interval by character class in order to detect correctly the section names in the table of content (eg. in the french version of the bash manual). Copyright-paperwork-exempt: yes --- lisp/man.el | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/lisp/man.el b/lisp/man.el index 0e1c92956b..97a4758e7e 100644 --- a/lisp/man.el +++ b/lisp/man.el @@ -278,21 +278,21 @@ Man-cooked-hook :type 'hook :group 'man) -(defvar Man-name-regexp "[-a-zA-Z0-9_+][-a-zA-Z0-9_.:+]*" +(defvar Man-name-regexp "[-[:alnum:]_+][-[:alnum:]_.:+]*" "Regular expression describing the name of a manpage (without section).") -(defvar Man-section-regexp "[0-9][a-zA-Z0-9+]*\\|[LNln]" +(defvar Man-section-regexp "[[:digit:]][[:alnum:]+]*\\|[LNln]" "Regular expression describing a manpage section within parentheses.") (defvar Man-page-header-regexp (if (string-match "-solaris2\\." system-configuration) - (concat "^[-A-Za-z0-9_].*[ \t]\\(" Man-name-regexp + (concat "^[-[:alnum:]_].*[ \t]\\(" Man-name-regexp "(\\(" Man-section-regexp "\\))\\)$") (concat "^[ \t]*\\(" Man-name-regexp "(\\(" Man-section-regexp "\\))\\).*\\1")) "Regular expression describing the heading of a page.") -(defvar Man-heading-regexp "^\\([A-Z][A-Z0-9 /-]+\\)$" +(defvar Man-heading-regexp "^\\([[:upper:]][[:upper:][:digit:] /-]+\\)$" "Regular expression describing a manpage heading entry.") (defvar Man-see-also-regexp "SEE ALSO" -- 2.13.3 ^ permalink raw reply related [flat|nested] 3+ messages in thread
* bug#27978: Detection of section name in man.el 2017-08-05 23:44 bug#27978: Detection of section name in man.el Grégory Mounié @ 2017-08-18 8:49 ` Eli Zaretskii [not found] ` <4f29a934-24db-6d10-db27-fd3a3a0c1269@imag.fr> 0 siblings, 1 reply; 3+ messages in thread From: Eli Zaretskii @ 2017-08-18 8:49 UTC (permalink / raw) To: Grégory Mounié; +Cc: 27978-done > From: Grégory Mounié > <Gregory.Mounie@imag.fr> > Date: Sun, 6 Aug 2017 01:44:19 +0200 > > When parsing manual in languages with non-ascii letters, the section > names using non-ascii letters are not added to the table of content. > > I noticed the bug reading the French bash manual: the quite useful > "COMMANDES INTERNES DE l'INTERPRÉTEUR" section does not appear (SHELL > BUILTIN COMMAND). (because of the É letter) > > I propose to use Character class instead of ascii interval in the > appropriate regexp defvar. It should not change anything for english > manual and it should work for many other languages. Thanks, I pushed these changes with some minor adjustments. Specifically: > -(defvar Man-section-regexp "[0-9][a-zA-Z0-9+]*\\|[LNln]" > +(defvar Man-section-regexp "[[:digit:]][[:alnum:]+]*\\|[LNln]" > "Regular expression describing a manpage section within parentheses.") I didn't change this one, because I think a section always uses only ASCII letters and numbers, as in ".1n". If you disagree, can you show an example where this is not so? > -(defvar Man-heading-regexp "^\\([A-Z][A-Z0-9 /-]+\\)$" > +(defvar Man-heading-regexp "^\\([[:upper:]][[:upper:][:digit:] /-]+\\)$" > "Regular expression describing a manpage heading entry.") I see no reason to replace 0-9 with [:digit:] here, since I think non-ASCII digits will never be used in this context. Do you agree? Incidentally, I see quite a few similar regexps elsewhere in man.el, did you audit all of them and established that they don't need similar changes? If not, would you like to propose similar changes there? ^ permalink raw reply [flat|nested] 3+ messages in thread
[parent not found: <4f29a934-24db-6d10-db27-fd3a3a0c1269@imag.fr>]
* bug#27978: Detection of section name in man.el [not found] ` <4f29a934-24db-6d10-db27-fd3a3a0c1269@imag.fr> @ 2017-08-18 19:23 ` Eli Zaretskii 0 siblings, 0 replies; 3+ messages in thread From: Eli Zaretskii @ 2017-08-18 19:23 UTC (permalink / raw) To: Grégory Mounié; +Cc: 27978 [Please keep the bug address on the CC list.] > From: Grégory Mounié <Gregory.Mounie@imag.fr> > Date: Fri, 18 Aug 2017 19:53:44 +0200 > > In brief, I would not change the other a-zA-Z regexps (details below). > > But I would change the SEE ALSO regexp (around line 298) to add other > languages. Should I fill another bug report with another patch ? > > (defvar Man-see-also-regexp "SEE ALSO" > "Regular expression for SEE ALSO heading (or your equivalent). > This regexp should not start with a `^' character.") > > using the debian manpages translation as référence, and using > "zgrep -h SH man*/* | sort | uniq -c | sort -n" inside appropriate > /usr/share/man subdirectories to infer the values, I propose: > > "SEE ALSO\|VOIR AUSSI\|SIEHE AUCH\|VÉASE TAMBIÉN\|VEJA TAMBÉM\|VEDERE > ANCHE\|ZOBACZ TAKŻE\|İLGİLİ BELGELER\|参照|参见 SEE ALSO\|參見 SEE ALSO" > > (French, German, Spanish, Portugese, Italian, Polish, Turkish, > Japanese, Chinese CN, Chinese TW) OK. If no one objects, I will make this change soon. Thanks. > Details below about the a-zA-Z regexps: > > Le 18/08/2017 à 10:49, Eli Zaretskii a écrit : > > > > Thanks, I pushed these changes with some minor adjustments. > > Specifically: > > > >> -(defvar Man-section-regexp "[0-9][a-zA-Z0-9+]*\\|[LNln]" > >> +(defvar Man-section-regexp "[[:digit:]][[:alnum:]+]*\\|[LNln]" > >> "Regular expression describing a manpage section within parentheses.") > > > > I didn't change this one, because I think a section always uses only > > ASCII letters and numbers, as in ".1n". If you disagree, can you show > > an example where this is not so? > > > > I have install the various multilingual standard manpages of my debian > and I have not grep a counter example so I guess it is perfect. > > >> -(defvar Man-heading-regexp "^\\([A-Z][A-Z0-9 /-]+\\)$" > >> +(defvar Man-heading-regexp "^\\([[:upper:]][[:upper:][:digit:] /-]+\\)$" > >> "Regular expression describing a manpage heading entry.") > > > > I see no reason to replace 0-9 with [:digit:] here, since I think > > non-ASCII digits will never be used in this context. Do you agree? > > > > Incidentally, I see quite a few similar regexps elsewhere in man.el, > > did you audit all of them and established that they don't need similar > > changes? If not, would you like to propose similar changes there? > > > > There are 18 a-Z. They seem like a detection carefully crafted by > history, thus I would not change them without counter-example either. > > The first four a-zA-Z seems related to the parsing of external > command, with particularities in Windows port so I would not recommend > to change it. > The 5-18 a-zA-Z try to guess the manpage around POS. The main pattern > is "-a-zA-Z0-9._+:" > > With the same set of multi-lingual manpages, I have found only one > character used in manpage name and not in the set: "[" (man [ leads you > to test). I suspect that adding "[" would add more regressions than > solutions. > > Note that line 720 the pattern is slightly different (missing "-._:"). > I do not understand really why. ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2017-08-18 19:23 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-08-05 23:44 bug#27978: Detection of section name in man.el Grégory Mounié 2017-08-18 8:49 ` Eli Zaretskii [not found] ` <4f29a934-24db-6d10-db27-fd3a3a0c1269@imag.fr> 2017-08-18 19:23 ` Eli Zaretskii
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.