From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Derick Eddington Newsgroups: gmane.emacs.help Subject: Regular expressions for Unicode general categories Date: Sun, 07 Dec 2008 12:47:13 -0800 Message-ID: <1228682833.4393.35.camel@eep> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1228683562 10163 80.91.229.12 (7 Dec 2008 20:59:22 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 7 Dec 2008 20:59:22 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Sun Dec 07 22:00:26 2008 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1L9Qjg-0004g0-Nb for geh-help-gnu-emacs@m.gmane.org; Sun, 07 Dec 2008 22:00:16 +0100 Original-Received: from localhost ([127.0.0.1]:40835 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1L9QiV-0006La-N6 for geh-help-gnu-emacs@m.gmane.org; Sun, 07 Dec 2008 15:59:03 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1L9QXC-0005fM-Cx for help-gnu-emacs@gnu.org; Sun, 07 Dec 2008 15:47:22 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1L9QXA-0005ch-Gt for help-gnu-emacs@gnu.org; Sun, 07 Dec 2008 15:47:21 -0500 Original-Received: from [199.232.76.173] (port=39081 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1L9QXA-0005c3-Aj for help-gnu-emacs@gnu.org; Sun, 07 Dec 2008 15:47:20 -0500 Original-Received: from rv-out-0506.google.com ([209.85.198.226]:59275) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1L9QX8-0003qE-M9 for help-gnu-emacs@gnu.org; Sun, 07 Dec 2008 15:47:19 -0500 Original-Received: by rv-out-0506.google.com with SMTP id g9so799246rvb.0 for ; Sun, 07 Dec 2008 12:47:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:subject:from:to:content-type :date:message-id:mime-version:x-mailer:content-transfer-encoding; bh=HQL8lxSDV+GF1dZxU+MgrbfRuUfR7iDyYxY+06Qhn0E=; b=c9O41NnbGtHXSJlvqy7UlEmxg303e+abEVYwbVeD+kYo17CU0cJUgmYIzFw2XWK5FW DUiZoAYfKhiYj9vgVjlD7D05bIDDPphbRHOEGxZ5We2+GPiXZFzgVaZO+euz4nR4jcsG gvSAs5fQv1jpy1fDbxugTXxIhV5uIrQkxATzI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=subject:from:to:content-type:date:message-id:mime-version:x-mailer :content-transfer-encoding; b=DpyR1SvoBs6CPp7S362XZHJ86uzR63GeV2aaAPLwbljCp6ykhtBd8bApHEJa0De2mu QWqJDcbGDRp72UicoTyziSA4xvOAWxo/RnNOW+Qg4LiTfw3cq/kH4Y2SX41keUpKOJ/M GvX8k8JxFsfcQzftaMO8XEYF7MYkNPFoT9AQ4= Original-Received: by 10.141.175.5 with SMTP id c5mr1260496rvp.243.1228682837767; Sun, 07 Dec 2008 12:47:17 -0800 (PST) Original-Received: from ?192.168.1.2? (pool-173-51-86-88.lsanca.fios.verizon.net [173.51.86.88]) by mx.google.com with ESMTPS id b39sm11735509rvf.0.2008.12.07.12.47.15 (version=SSLv3 cipher=RC4-MD5); Sun, 07 Dec 2008 12:47:16 -0800 (PST) X-Mailer: Evolution 2.24.2 X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6 (newer, 2) X-Mailman-Approved-At: Sun, 07 Dec 2008 15:58:36 -0500 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:60430 Archived-At: Hello, I am making an Emacs regular expression for matching R6RS Scheme "identifiers" (part of the syntax highlighting of a major mode I'm making), and it needs to match characters based on their Unicode general categories. It seems Emacs regular expressions do not provide a way to do that directly (I'm using Emacs 23.0.60.1) (I couldn't find anything about this in the Info docs, emacswiki.org, or this list's archives), so I computed regular expression character sets for the needed general categories (using `get-char-code-property') and placed these in their positions in the larger regular expression. My problem is I can't use it because I get this error: Error during redisplay: (invalid-regexp Regular expression too big) which is understandable because the general category character sets are giant and a bunch of them are used, and I suspect they might have been too inefficient anyways. So, what can I do? If Emacs regular expressions' backslash construct `\cC' supported Unicode general categories, or if there was some construct which did, I think that would do it nicely. Is that planned, or should I resort to doing more manual parsing, or something else? JTMI, the reason identifiers need to be recognized using their complete lexical specification is because I'm also highlighting numbers and they have a lexical syntax which overlaps with identifiers and so identifiers need to be fontified first just so they're not partially fontified as numbers. Thank you for help, -- : Derick ----------------------------------------------------------------