From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.help Subject: Re: Regexp capturing unicode characters Date: Thu, 01 Aug 2024 15:10:57 +0300 Message-ID: <86le1gwii6.fsf@gnu.org> References: <865xskygar.fsf@gnu.org> <2wHi4S9MruOl3ZOkpjKnin3CJxnVnomMkaIdhl-i3OF7AYEda3X-7-1ijhWUrLZ22JwOMXQu5ntZ3FFBuAlmhkpMxgXFbhZ-sS_XMmCrE4g=@protonmail.com> Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="6171"; mail-complaints-to="usenet@ciao.gmane.io" To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Thu Aug 01 14:11:44 2024 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1sZUel-0001Oy-R0 for geh-help-gnu-emacs@m.gmane-mx.org; Thu, 01 Aug 2024 14:11:44 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sZUeA-000203-Ms; Thu, 01 Aug 2024 08:11:06 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sZUe8-0001zZ-6N for help-gnu-emacs@gnu.org; Thu, 01 Aug 2024 08:11:04 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sZUe7-0001cv-RH for help-gnu-emacs@gnu.org; Thu, 01 Aug 2024 08:11:03 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=References:Subject:In-Reply-To:To:From:Date: mime-version; bh=fJPd2W1Q5VU3eBJK76zABRCaGpacSxkVs0xnybHfFL8=; b=fbXxCLBM/Bpm FxaiGa/pzNmvtd64BgGSZv7a1kGdTL8s/KxUJb3MTptjkfOlFfBA1K5kK2vXQ5mG7HyL13JcbJ+ED 85MAKn47GRDRKuSxk9qc3KJFMoY5gyYAJZAAqt6AWMGGl3ytX3tu//Hv5UuaFYVW7g2b9E5V3t8+5 ZKHwY+w9qGPrXixzhSWnp6vvoF3JsZNdCFgIcd6KHUXXO5h7RBZHIW//x02TggeOfNbAMKt4RamqC qoExg/x3hi7i2kOfACS0WaUbGcSQNBMWCMg9HxxhuzOzl8FyjaIAY4WN6ZYhEhl/OMcbZX3ex8FKf eemI3f76flEfjfCti80ajA==; In-Reply-To: <2wHi4S9MruOl3ZOkpjKnin3CJxnVnomMkaIdhl-i3OF7AYEda3X-7-1ijhWUrLZ22JwOMXQu5ntZ3FFBuAlmhkpMxgXFbhZ-sS_XMmCrE4g=@protonmail.com> (message from Heime on Thu, 01 Aug 2024 11:26:40 +0000) X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.help:147482 Archived-At: > Date: Thu, 01 Aug 2024 11:26:40 +0000 > From: Heime > Cc: help-gnu-emacs@gnu.org > > On Thursday, August 1st, 2024 at 5:15 PM, Eli Zaretskii wrote: > > > > Date: Wed, 31 Jul 2024 21:24:46 +0000 > > > From: Heime heimeborgia@protonmail.com > > > > > > I am using unicode characters in my elisp code (e.g. foreign language symbols in icelandic > > > and spanish). > > > > > > Is the regexp [[:word:]] appropriate to capture them ? > > > > > > No. [[:word:]] matches characters that have the word syntax, so which > > characters match depends on the major mode. My suggestion is to use > > either [[:alnum:]] or [[:alpha:]] instead, depending on whether you > > want or don't want to match digit characters. > > > > The meaning of each character class is documented in the "Char > > Classes" node of the ELisp Reference manual, I suggest to read it and > > choose the most appropriate one for your needs. > > It is difficult to determine from a character class, the actual character. Why do you need that? Don't you know which characters you'd like to match? > Is there a way to show the characters that are members of each class ? No, but you can check each character whether it matches a class. > Thought that [:multibyte:] captured the unicode characters. Bet even when > I applied (set-buffer-multibyte t) to the buffer, I did not get matches. Don't use [:multibyte:], it is hardly ever the right thing nowadays. > Does [:word:] mean word in the english language only ? No, it means characters that have the word _syntax_. IOW, which character match depends on the major mode's syntax table. If you are classifying characters from human-readable text, [:word:] is not the right thing to use.