From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eric Abrahamsen Newsgroups: gmane.emacs.help Subject: Using syntax tables to parse buffer content Date: Tue, 18 May 2021 14:02:27 -0700 Message-ID: <875yzfwyak.fsf@ericabrahamsen.net> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="13429"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux) To: help-gnu-emacs@gnu.org Cancel-Lock: sha1:WUgZw47ht1nkLJ3bCwGTAC7BPeA= Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Tue May 18 23:03:46 2021 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1lj6sU-0003Jl-GO for geh-help-gnu-emacs@m.gmane-mx.org; Tue, 18 May 2021 23:03:46 +0200 Original-Received: from localhost ([::1]:60150 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lj6sT-0004Ty-JN for geh-help-gnu-emacs@m.gmane-mx.org; Tue, 18 May 2021 17:03:45 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:54788) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lj6rR-0004Rb-Ll for help-gnu-emacs@gnu.org; Tue, 18 May 2021 17:02:45 -0400 Original-Received: from ciao.gmane.io ([116.202.254.214]:51262) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lj6rN-0004Fc-NV for help-gnu-emacs@gnu.org; Tue, 18 May 2021 17:02:40 -0400 Original-Received: from list by ciao.gmane.io with local (Exim 4.92) (envelope-from ) id 1lj6rL-00021K-Ow for help-gnu-emacs@gnu.org; Tue, 18 May 2021 23:02:35 +0200 X-Injected-Via-Gmane: http://gmane.org/ Received-SPF: pass client-ip=116.202.254.214; envelope-from=geh-help-gnu-emacs@m.gmane-mx.org; helo=ciao.gmane.io X-Spam_score_int: -16 X-Spam_score: -1.7 X-Spam_bar: - X-Spam_report: (-1.7 / 5.0 requ) BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.249, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.io gmane.emacs.help:129905 Archived-At: Hi! I often find myself parsing buffer or file contents using regular expressions, and would much rather be using lower-level character syntax to do it, both for reasons of speed and correctness. I've been looking into using syntax tables to assign certain classes to characters, and using either basic stuff like `skip-syntax-forward', or maybe `parse-partial-sexp', to pull substrings out of a buffer. My main problem now is escaping: I don't know how to treat escaped special characters as non-special. The simplest example is in vCard parsing. A property line might look like this: URL;TYPE=homepage:https\://mygreatpage.com/ ^ ^ ^ I've indicated the significant characters above: they include semicolon, colon, equals, and comma. The semicolon in the URL is escaped, and shouldn't be treated specially. These characters don't seem to fit the existing syntax classes, so I've considered defining my own categories for them. The manual mentions escape syntax characters (the "\" class), but doesn't quite make it clear *what* it escapes: I'm guessing only open/close parentheses, and string delimiters? Then there's character quote (the "/" class), which says the following character will "lose its normal syntactic meaning", but I can't get that to *do* anything. For example, in a text-mode test buffer, I add the "/" syntax class to ?*, then put that character before a space character, thinking it might negate the space's whitespace class. That doesn't happen, though, as (skip-syntax-forward "^ ") still stops at the space. What am I missing, and is this kind of custom escaping possible? I can peek back at the previous character, but at that point it's not too different from regexp parsing. Thanks in advance! Eric