Using syntax tables to parse buffer content

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* Using syntax tables to parse buffer content
@ 2021-05-24 20:21 Eric Abrahamsen
  2021-05-24 21:07 ` Stefan Monnier
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Abrahamsen @ 2021-05-24 20:21 UTC (permalink / raw)
  To: emacs-devel

[I sent this to emacs.help a few days ago, but am hoping I'll have
better luck over here, sorry...]

Hi!

I often find myself parsing buffer or file contents using regular
expressions, and would much rather be using lower-level character syntax
to do it, both for reasons of speed and correctness. I've been looking
into using syntax tables to assign certain classes to characters, and
using either basic stuff like `skip-syntax-forward', or maybe
`parse-partial-sexp', to pull substrings out of a buffer.

My main problem now is escaping: I don't know how to treat escaped
special characters as non-special. The simplest example is in vCard
parsing. A property line might look like this:

URL;TYPE=homepage:https\://mygreatpage.com/
   ^    ^        ^

I've indicated the significant characters above: they include semicolon,
colon, equals, and comma. The semicolon in the URL is escaped, and
shouldn't be treated specially. These characters don't seem to fit the
existing syntax classes, so I've considered defining my own categories
for them.

The manual mentions escape syntax characters (the "\" class), but
doesn't quite make it clear *what* it escapes: I'm guessing only
open/close parentheses, and string delimiters? Then there's character
quote (the "/" class), which says the following character will "lose its
normal syntactic meaning", but I can't get that to *do* anything.

For example, in a text-mode test buffer, I add the "/" syntax class to
?*, then put that character before a space character, thinking it might
negate the space's whitespace class. That doesn't happen, though, as
(skip-syntax-forward "^ ") still stops at the space.

What am I missing, and is this kind of custom escaping possible? I can
peek back at the previous character, but at that point it's not too
different from regexp parsing.

Thanks in advance!
Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Using syntax tables to parse buffer content
  2021-05-24 20:21 Using syntax tables to parse buffer content Eric Abrahamsen
@ 2021-05-24 21:07 ` Stefan Monnier
  2021-05-26 16:43   ` Eric Abrahamsen
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Monnier @ 2021-05-24 21:07 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: emacs-devel

> For example, in a text-mode test buffer, I add the "/" syntax class to
> ?*, then put that character before a space character, thinking it might
> negate the space's whitespace class. That doesn't happen, though, as
> (skip-syntax-forward "^ ") still stops at the space.

skip-syntax-forward only looks at the actual syntax, so it doesn't pay
attention to anything before/after.  The "/" class is effective when you
consider operations like `forward-sexp`, which might consider
`foo\ bar` as a single "symbol" rather than two.


        Stefan




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Using syntax tables to parse buffer content
  2021-05-24 21:07 ` Stefan Monnier
@ 2021-05-26 16:43   ` Eric Abrahamsen
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Abrahamsen @ 2021-05-26 16:43 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> For example, in a text-mode test buffer, I add the "/" syntax class to
>> ?*, then put that character before a space character, thinking it might
>> negate the space's whitespace class. That doesn't happen, though, as
>> (skip-syntax-forward "^ ") still stops at the space.
>
> skip-syntax-forward only looks at the actual syntax, so it doesn't pay
> attention to anything before/after.

Aha! Maybe I can suggest some documentation patches here.

> The "/" class is effective when you consider operations like
> `forward-sexp`, which might consider `foo\ bar` as a single "symbol"
> rather than two.

So would you suggest that I slightly abuse the concept of sexps here?
Maybe start with a completely blank syntax table, and give "=:;,"
punctuation class, and then `forward-sexp' to eat up chunks of text...



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Using syntax tables to parse buffer content
@ 2021-05-18 21:02 Eric Abrahamsen
  2021-05-18 22:35 ` Jean Louis
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Abrahamsen @ 2021-05-18 21:02 UTC (permalink / raw)
  To: help-gnu-emacs

Hi!

I often find myself parsing buffer or file contents using regular
expressions, and would much rather be using lower-level character syntax
to do it, both for reasons of speed and correctness. I've been looking
into using syntax tables to assign certain classes to characters, and
using either basic stuff like `skip-syntax-forward', or maybe
`parse-partial-sexp', to pull substrings out of a buffer.

My main problem now is escaping: I don't know how to treat escaped
special characters as non-special. The simplest example is in vCard
parsing. A property line might look like this:

URL;TYPE=homepage:https\://mygreatpage.com/
   ^    ^        ^

I've indicated the significant characters above: they include semicolon,
colon, equals, and comma. The semicolon in the URL is escaped, and
shouldn't be treated specially. These characters don't seem to fit the
existing syntax classes, so I've considered defining my own categories
for them.

The manual mentions escape syntax characters (the "\" class), but
doesn't quite make it clear *what* it escapes: I'm guessing only
open/close parentheses, and string delimiters? Then there's character
quote (the "/" class), which says the following character will "lose its
normal syntactic meaning", but I can't get that to *do* anything.

For example, in a text-mode test buffer, I add the "/" syntax class to
?*, then put that character before a space character, thinking it might
negate the space's whitespace class. That doesn't happen, though, as
(skip-syntax-forward "^ ") still stops at the space.

What am I missing, and is this kind of custom escaping possible? I can
peek back at the previous character, but at that point it's not too
different from regexp parsing.

Thanks in advance!
Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Using syntax tables to parse buffer content
  2021-05-18 21:02 Eric Abrahamsen
@ 2021-05-18 22:35 ` Jean Louis
  2021-05-18 22:53   ` Eric Abrahamsen
  0 siblings, 1 reply; 8+ messages in thread
From: Jean Louis @ 2021-05-18 22:35 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: help-gnu-emacs

* Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-19 00:04]:
> My main problem now is escaping: I don't know how to treat escaped
> special characters as non-special. The simplest example is in vCard
> parsing. A property line might look like this:
> 
> URL;TYPE=homepage:https\://mygreatpage.com/

That is when important "standards" like vCard are written by people
that lack global knowledge of data structures. Would they write it in
LISP data or at least XML, we could all easily parse it, including by
using other programming languages. But no...

-- 
Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

Sign an open letter in support of Richard M. Stallman
https://stallmansupport.org/
https://rms-support-letter.github.io/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Using syntax tables to parse buffer content
  2021-05-18 22:35 ` Jean Louis
@ 2021-05-18 22:53   ` Eric Abrahamsen
  2021-05-18 23:21     ` Jean Louis
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Abrahamsen @ 2021-05-18 22:53 UTC (permalink / raw)
  To: help-gnu-emacs

Jean Louis <bugs@gnu.support> writes:

> * Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-19 00:04]:
>> My main problem now is escaping: I don't know how to treat escaped
>> special characters as non-special. The simplest example is in vCard
>> parsing. A property line might look like this:
>> 
>> URL;TYPE=homepage:https\://mygreatpage.com/
>
> That is when important "standards" like vCard are written by people
> that lack global knowledge of data structures. Would they write it in
> LISP data or at least XML, we could all easily parse it, including by
> using other programming languages. But no...

There have been further efforts based on XML and JSON, but nothing has
quite gained the currency of vCard, so here we are...




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Using syntax tables to parse buffer content
  2021-05-18 22:53   ` Eric Abrahamsen
@ 2021-05-18 23:21     ` Jean Louis
  2021-05-19  0:26       ` Eric Abrahamsen
  0 siblings, 1 reply; 8+ messages in thread
From: Jean Louis @ 2021-05-18 23:21 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: help-gnu-emacs

* Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-19 01:54]:
> Jean Louis <bugs@gnu.support> writes:
> 
> > * Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-19 00:04]:
> >> My main problem now is escaping: I don't know how to treat escaped
> >> special characters as non-special. The simplest example is in vCard
> >> parsing. A property line might look like this:
> >> 
> >> URL;TYPE=homepage:https\://mygreatpage.com/
> >
> > That is when important "standards" like vCard are written by people
> > that lack global knowledge of data structures. Would they write it in
> > LISP data or at least XML, we could all easily parse it, including by
> > using other programming languages. But no...
> 
> There have been further efforts based on XML and JSON, but nothing has
> quite gained the currency of vCard, so here we are...

Is it vCard that you wish to parse?

I have made some vCard exporting functions. But I would need
importing. Some packages already exist.


-- 
Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

Sign an open letter in support of Richard M. Stallman
https://stallmansupport.org/
https://rms-support-letter.github.io/




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Using syntax tables to parse buffer content
  2021-05-18 23:21     ` Jean Louis
@ 2021-05-19  0:26       ` Eric Abrahamsen
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Abrahamsen @ 2021-05-19  0:26 UTC (permalink / raw)
  To: help-gnu-emacs

Jean Louis <bugs@gnu.support> writes:

> * Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-19 01:54]:
>> Jean Louis <bugs@gnu.support> writes:
>> 
>> > * Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-19 00:04]:
>> >> My main problem now is escaping: I don't know how to treat escaped
>> >> special characters as non-special. The simplest example is in vCard
>> >> parsing. A property line might look like this:
>> >> 
>> >> URL;TYPE=homepage:https\://mygreatpage.com/
>> >
>> > That is when important "standards" like vCard are written by people
>> > that lack global knowledge of data structures. Would they write it in
>> > LISP data or at least XML, we could all easily parse it, including by
>> > using other programming languages. But no...
>> 
>> There have been further efforts based on XML and JSON, but nothing has
>> quite gained the currency of vCard, so here we are...
>
> Is it vCard that you wish to parse?
>
> I have made some vCard exporting functions. But I would need
> importing. Some packages already exist.

I've actually already written the package!

https://elpa.gnu.org/packages/vcard.html

There were some existing things, but all seemingly part of other
packages. I needed something for EBDB, and wanted to write a library
that was pure vcard->lisp, so it would be useful to other people, as
well. The library parses to elisp structures that can be consumed by
whomever. I'm doing it with (very ugly) regular expressions now, and
want to move to syntax parsing, which I suspect will be much faster.

I'd also just like to learn this technique, as I suspect it would also
be very useful in several places in Gnus.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-05-26 16:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-05-24 20:21 Using syntax tables to parse buffer content Eric Abrahamsen
2021-05-24 21:07 ` Stefan Monnier
2021-05-26 16:43   ` Eric Abrahamsen
  -- strict thread matches above, loose matches on Subject: below --
2021-05-18 21:02 Eric Abrahamsen
2021-05-18 22:35 ` Jean Louis
2021-05-18 22:53   ` Eric Abrahamsen
2021-05-18 23:21     ` Jean Louis
2021-05-19  0:26       ` Eric Abrahamsen

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.