confusion over undocumented syntax-table features, font-lock and syntax-tables

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* confusion over undocumented syntax-table features, font-lock and syntax-tables
@ 2003-02-11  5:08 Matthew Swift
  0 siblings, 0 replies; 7+ messages in thread
From: Matthew Swift @ 2003-02-11  5:08 UTC (permalink / raw)


This bug report will be sent to the Free Software Foundation,
not to your local site managers!
Please write in English, because the Emacs maintainers do not have
translators to read other languages for them.

Your bug report will be posted to the bug-gnu-emacs@gnu.org mailing list,
and to the gnu.emacs.bug news group.

In GNU Emacs 21.2.1 (i386-debian-linux-gnu, X toolkit, Xaw3d scroll bars)
 of 2002-11-06 on beth, modified by Debian
configured using `configure  i386-debian-linux-gnu --prefix=/usr/local --sharedstatedir=/var/lib --libexecdir=/usr/local/lib --localstatedir=/var/lib --infodir=/usr/local/share/info --mandir=/usr/local/share/man --with-pop=yes --with-x=yes --with-x-toolkit=athena --without-gif'
Important settings:
  value of $LC_ALL: nil
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: nil
  locale-coding-system: nil
  default-enable-multibyte-characters: t

Please describe exactly what actions triggered the bug
and the precise symptoms of the bug:

I was observing a strange behavior in `sh-mode' defined in sh-script.el where
(re-search-forward "\\s<\\s<") was failing even though it was passing over a
buffer substring of two characters whose syntax classes, as reported by
`(char-syntax (char-after N))' and N+1 was "<".

I have not figured out why that happens, and it may not be a bug, but in my
experiments, I have come across a barrel full of puzzles and questions.  I am
reporting as much as I have been able to distinguish.

The results of the following code completely baffles me.  Is
global-font-lock-mode changing the syntax classes?

-----cut here
    (setq test "
    hello () { echo world.; }
    ## boln is at buffer position 40
    ")
    (defun test ()
      (sh-mode)
      (message "result is %S"
               (if (and 
                    (equal "<" (char-to-string (char-syntax ?#)))
                    (equal (char-after 40) ?#)
                    (equal (char-after 41) ?#)
                    (equal "<" (char-to-string (char-syntax (char-after 40))))
                    (equal "<" (char-to-string (char-syntax (char-after 41))))
                    )
               (save-excursion
                 (goto-char (point-min))
                 (re-search-forward "\\s<\\s<"))
	     "whoops!")))
    (progn
      (global-font-lock-mode 0)
      ;; succeeds
      (test))
    (progn
      (global-font-lock-mode 1)
      ;; `re-search-forward' fails the SECOND time, if not the first (no
      ;; pattern found)
      (test))

    ;;(sh-mode)
    ;;(emacs-lisp-mode)
    ;;(global-font-lock-mode)
    ;;(test)
---- end of test file

The facility for matching chars in syntax descriptors is either not fully
documented or has some other problems.  Looking into it further would take more
time than I have at the moment.

sh-script.el says:

    (defvar sh-mode-syntax-table
      '((sh eval sh-mode-syntax-table ()
            ?\# "<"
            ?\n ">#"
            ?\" "\"\""
            ?\' "\"'"
            ?\` "\"`"
            ?! "_"
            ?% "_"
            ?: "_"
            ?. "_"
            ?^ "_"
            ?~ "_"
            ?< "."
            ?> ".")
        (csh eval identity sh)
        (rc eval identity sh))

      "Syntax-table used in Shell-Script mode.  See `sh-feature'.")

Consider the second entry in the table, which is the equivalent of

         (modify-syntax-entry ?\n ">#")

The documentation for syntax descriptors says (both in TeXinfo and in
functions' docstrings) that the second character, the matching character, is
"used" only when the syntax class is "(" or ")" (open or close parentheses).

The declaration above assigns a matching character to a character with the
endcomment syntax class.  The documentation does not say doing this is an
error.  But from here, all possibilities imply one or more problems.  (And I
should observe that it seems that, furthermore, several major modes assign
matching characters to chars in the string delimiter (") class (usually the
same one, e.g., " with " and ' with '); this usage is likewise problematic.)

If the declaration of ">#" is equivalent to ">", with respect to all Emacs
primitives and distributed Lisp code, then

   + sh-script.el should use simply ">" for clarity.

   It may be desirable to leave in a facility for assigning matching chars to
   non-paren classes, so that programmers can do something with it.  If so,
   brief mention should be made in the TeXinfo documentation, if not the
   docstrings.  If not, then

       + it should be documented that matching chars are ignored except
         for the "(" and ")" classes;

       + `modify-syntax-entry' should decline to install ignored matching chars
         by either signalling an error or by silently deleting the matching
         char;

       + `describe-syntax' should decline to report matching chars that do not
         have any significance, because reporting them is confusing
         (`describe-syntax' will report that ?\n matches ?#, and likewise if
         you assign matching chars to chars in other syntax classes for which
         matching seems irrelevant).

If the declaration of ">#" is not equivalent to ">", then either the behavior
is undefined or it is well-defined but not documented.  If it is undefined,
then sh-script.el should not be using it.  If it is undocumented, then it
should be documented.

Recent input:
M-x r e p o r t - e m a c s - b u g <return>

Recent messages:
1 <- require: gnus-group
1 -> require: gnus-start
1 <- require: gnus-start
1 -> require: gnus-util
1 <- require: gnus-util
Loading gnus-topic...done
Loading emacsbug...
1 -> require: sendmail
1 <- require: sendmail
Loading emacsbug...done

^ permalink raw reply	[flat|nested] 7+ messages in thread

* confusion over undocumented syntax-table features, font-lock and syntax-tables
@ 2003-02-13  3:43 Luc Teirlinck
  0 siblings, 0 replies; 7+ messages in thread
From: Luc Teirlinck @ 2003-02-13  3:43 UTC (permalink / raw)
  Cc: bug-gnu-emacs

Matthew Smith wrote:

    The results of the following code completely baffles me.  Is
    global-font-lock-mode changing the syntax classes?

Yes, using syntax text properties, and it is impossible to get the
correct shell syntax without such text properties.  (However does
anybody understand the behavior of char-syntax in the ielm run
below???)

Did you specify which shell you are using?  If it is bash, then #
starts a comment at the beginning of a word, elsewhere it has symbol
syntax.

    I was observing a strange behavior in `sh-mode' defined in
    sh-script.el where (re-search-forward "\\s<\\s<") was failing even
    though it was passing over a buffer substring of two characters
    whose syntax classes, as reported by `(char-syntax (char-after
    N))' and N+1 was "<".

I do not know how you possibly can get two consecutive characters with
comment-start syntax in bash. (I do not know about other shells.)

I cut out the "test file" you included (see below) and put point at
the beginning of the line:

## boln is at buffer position 40

Then I ran ielm (for convenience, if you prefer, ypu can use M-:).

The result shows the (correcting) influence of font-lock-mode:
(Ran using:
 emacs-21.3.50 -q --no-site-file --eval '(blink-cursor-mode 0)' &
This is today's CVS.)

Remember that, in the syntax-after return values, 3 stands for symbol,
11 for comment-start.

===File ~/shellsyntax=======================================
*** Welcome to IELM ***  Type (describe-mode) for help.
ELISP> (set-buffer "testfile.sh")  ;; ielm specific code
#<buffer testfile.sh>
ELISP> (current-buffer)
#<buffer testfile.sh>
ELISP> (point)
40
ELISP> parse-sexp-ignore-comments
t
ELISP> (string (char-syntax (point)))
"("  ;; goes completely over my head
ELISP> (string (char-syntax (1+ (point))))
")"  ;; this too
ELISP> (get-char-property (point) 'syntax-table)
nil
ELISP> (get-char-property (1+ (point)) 'syntax-table)
nil
ELISP> (syntax-after (point))
(11)  ;; I understand this.

ELISP> (syntax-after (1+ (point)))
(11)  ;; This too, even though it is wrong.

ELISP> (global-font-lock-mode 1)
t
ELISP> (string (char-syntax (point)))
"("  ;; ???
ELISP> (string (char-syntax (1+ (point))))
")"  ;; ???
ELISP> (get-char-property (point) 'syntax-table)
nil
ELISP> (get-char-property (1+ (point)) 'syntax-table)
(3)  ;; font-lock-mode to the rescue

ELISP> (syntax-after (point))
(11)  

ELISP> (syntax-after (1+ (point)))
(3)  ;; correct: the second # has symbol syntax,
     ;; it does not start a comment.

ELISP> sh-shell
bash
ELISP> sh-shell-file
"/usr/local/bin/bash"
ELISP> (string (char-syntax ?#))
"<"  ;; I undestand this too, but how does this rhyme with the above???
ELISP> ============================================================

Do not ask me to explain the char-syntax behavior.  I have no clue.

Sincerely,

Luc.

Appendix:

Test file used:

-----cut here
(setq test "
hello () { echo world.; }
## boln is at buffer position 40
")
(defun test ()
  (sh-mode)
  (message "result is %S"
           (if (and
                (equal "<" (char-to-string (char-syntax ?#)))
                (equal (char-after 40) ?#)
                (equal (char-after 41) ?#)
                (equal "<" (char-to-string (char-syntax (char-after 40))))
                (equal "<" (char-to-string (char-syntax (char-after 41))))
                )
           (save-excursion
             (goto-char (point-min))
             (re-search-forward "\\s<\\s<"))
         "whoops!")))
(progn
  (global-font-lock-mode 0)
  ;; succeeds
  (test))
(progn
  (global-font-lock-mode 1)
  ;; `re-search-forward' fails the SECOND time, if not the first (no
  ;; pattern found)
  (test))

;;(sh-mode)
;;(emacs-lisp-mode)
;;(global-font-lock-mode)
;;(test)
---- end of test file

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: confusion over undocumented syntax-table features, font-lock and syntax-tables
@ 2003-02-13  4:17 Luc Teirlinck
  0 siblings, 0 replies; 7+ messages in thread
From: Luc Teirlinck @ 2003-02-13  4:17 UTC (permalink / raw)
  Cc: bug-gnu-emacs

>From my previous message:

   Did you specify which shell you are using?  If it is bash, then #
   starts a comment at the beginning of a word, elsewhere it has symbol
   syntax.

To avoid confusion:

"word" is here, of course, meant in the bash sense, not in the Elisp
sense.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: confusion over undocumented syntax-table features, font-lock and syntax-tables
@ 2003-02-13 15:09 Luc Teirlinck
  0 siblings, 0 replies; 7+ messages in thread
From: Luc Teirlinck @ 2003-02-13 15:09 UTC (permalink / raw)
  Cc: bug-gnu-emacs

>From my previous message:

   (However does anybody understand the behavior of char-syntax in the
   ielm run below???)

Forget about this and previous remarks about char-syntax.  I just was
thinking about too many things at the same time and forgot about the
exact meaning of char-syntax for a moment.  Sorry about that.  I do
not believe that there are any bugs related to the things discussed in
this thread.  Matthew just forgot about text properties and I just
totally got confused for a while.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: confusion over undocumented syntax-table features, font-lock and syntax-tables
       [not found] <mailman.1933.1045148974.21513.bug-gnu-emacs@gnu.org>
@ 2003-02-15 20:11 ` Matt Swift
  0 siblings, 0 replies; 7+ messages in thread
From: Matt Swift @ 2003-02-15 20:11 UTC (permalink / raw)


>> "L" == Luc wrote:

    L> I just was thinking about too many things at the same time and
    L> forgot about the exact meaning of char-syntax for a moment. 
[...]
    L> I do not believe that there are any bugs related to the things
    L> discussed in this thread.  Matthew just forgot about text
    L> properties and I just totally got confused for a while.

Yes, I did not even realize that a text property could override the
syntax table.  

One question remains, which may yet be bug of some kind:

What is the meaning of assigning a matching char to chars in syntax
class comment-starter (`<') or string-quote (`"')?  Several packages,
including sh-script.el, make such an assignment, but I do not see that
the meaning of matching chars is documented except for syntax classes
open-parenthesis and close-parenthesis.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* confusion over undocumented syntax-table features, font-lock and syntax-tables
@ 2003-02-15 23:37 Luc Teirlinck
  2003-02-16  5:46 ` Matt Swift
  0 siblings, 1 reply; 7+ messages in thread
From: Luc Teirlinck @ 2003-02-15 23:37 UTC (permalink / raw)
  Cc: bug-gnu-emacs

Matthew Smith wrote:

    One question remains, which may yet be bug of some kind:

    What is the meaning of assigning a matching char to chars in syntax
    class comment-starter (`<') or string-quote (`"')?  Several packages,
    including sh-script.el, make such an assignment, but I do not see that
    the meaning of matching chars is documented except for syntax classes
    open-parenthesis and close-parenthesis.

The Elisp manual indeed says:

   A syntax descriptor is a Lisp string that specifies a syntax class,
   a matching character (used only for the parenthesis classes) and flags.

Hence there indeed is either an inaccuracy in the Elisp manual and the
various documentation strings you alluded to in your prior posting or
the usage in the packages you mentioned is inappropriate.

I do not know which of the two alternatives applies.  However, use of
a matching character for comment starter and ender seems to make at
least some sense to me.  For string quote, it seems really strange.
The Elisp manual says:

 - Syntax class: string quote
     "String quote characters" (designated by `"') are used in many
     languages, including Lisp and C, to delimit string constants.
     The same string quote character appears at the beginning and the
     end of a string.  Such quoted strings do not nest.

Thus, if a package mentions the identical character as a matcher, then
this seems totally redundant.  If it mentions another character, it
seems dangerous (at least to me), since plenty of Lisp code might
expect the identical character to match, relying on the above quote.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: confusion over undocumented syntax-table features, font-lock and syntax-tables
  2003-02-15 23:37 Luc Teirlinck
@ 2003-02-16  5:46 ` Matt Swift
  0 siblings, 0 replies; 7+ messages in thread
From: Matt Swift @ 2003-02-16  5:46 UTC (permalink / raw)
  Cc: Luc Teirlinck

>> "L" == Luc wrote:

    L> Hence there indeed is either an inaccuracy in the Elisp manual and the
    L> various documentation strings you alluded to in your prior posting or
    L> the usage in the packages you mentioned is inappropriate.

    L> I do not know which of the two alternatives applies. 

    L> For string quote, it seems really strange. 

    L>  - Syntax class: string quote
    L>      "String quote characters" (designated by `"') are used in many
    L>      languages, including Lisp and C, to delimit string constants.
    L>      The same string quote character appears at the beginning and the
    L>      end of a string.  Such quoted strings do not nest.

What seems to be implied here by "the same" is that if distinct chars
C and D are both declared as string-quote class, then for example the
buffer substring

  C----D++++D====C

contains the delimited string constants "----D++++D====" and "++++".
If the initial C matched the next string-quote char rather than the
next C, then the two strings would be "----" and "====".

Perhaps the authors of these packages thought they would get the
second result if they did not specify a matching character.  Perhaps
it was indeed necessary in an earlier version of Emacs.  Perhaps it
means something significant that no one has mentioned yet (I checked
the source and it would take a long time for me to answer it that way;
I am not really interested in the answer, it is just an issue that
came up as a possible answer while I was trying to solve another
problem.)

    L> Thus, if a package mentions the identical character as a matcher, then
    L> this seems totally redundant.  If it mentions another character, it
    L> seems dangerous (at least to me), since plenty of Lisp code might
    L> expect the identical character to match, relying on the above quote.

Following is a crude (i.e. incomplete) listing of packages in the
latest CVS Emacs that specify a matching char for a char of class
other than ( or ).  You can't get a complete list with a single regexp
match (the quoting and quote marks will drive you crazy, and many
packages do not set the syntax table with constants), and this is
trimmed of some false hits:

    ./progmodes/m4-mode.el:(modify-syntax-entry ?# "<\n" m4-mode-syntax-table)
    ./progmodes/m4-mode.el:(modify-syntax-entry ?\n ">#" m4-mode-syntax-table)
    ./textmodes/bibtex.el:    (modify-syntax-entry ?$ "$$  " st)
    ./textmodes/sgml-mode.el:       (modify-syntax-entry ?\" "\"\"" table))
    ./textmodes/sgml-mode.el:       (modify-syntax-entry ?\' "\"'" table))

Again with latest CVS Emacs, sh-script.el doesn't set it vars simply,
but starting a fresh emacs and entering sh-mode[bash], then C-hs
gives:

C-j		>#	which means: endcomment, matches #
"		""	which means: string, matches "
'		"'	which means: string, matches '
`		"`	which means: string, matches `

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-02-16  5:46 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-13  4:17 confusion over undocumented syntax-table features, font-lock and syntax-tables Luc Teirlinck
  -- strict thread matches above, loose matches on Subject: below --
2003-02-15 23:37 Luc Teirlinck
2003-02-16  5:46 ` Matt Swift
     [not found] <mailman.1933.1045148974.21513.bug-gnu-emacs@gnu.org>
2003-02-15 20:11 ` Matt Swift
2003-02-13 15:09 Luc Teirlinck
2003-02-13  3:43 Luc Teirlinck
2003-02-11  5:08 Matthew Swift

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).