* Regexp capturing unicode characters @ 2024-07-31 21:24 Heime 2024-07-31 21:50 ` Heime 2024-08-01 5:15 ` Eli Zaretskii 0 siblings, 2 replies; 13+ messages in thread From: Heime @ 2024-07-31 21:24 UTC (permalink / raw) To: Heime via Users list for the GNU Emacs text editor I am using unicode characters in my elisp code (e.g. foreign language symbols in icelandic and spanish). Is the regexp [[:word:]] appropriate to capture them ? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regexp capturing unicode characters 2024-07-31 21:24 Regexp capturing unicode characters Heime @ 2024-07-31 21:50 ` Heime 2024-08-01 5:15 ` Eli Zaretskii 1 sibling, 0 replies; 13+ messages in thread From: Heime @ 2024-07-31 21:50 UTC (permalink / raw) To: Heime; +Cc: Heime via Users list for the GNU Emacs text editor Sent with Proton Mail secure email. On Thursday, August 1st, 2024 at 9:24 AM, Heime <heimeborgia@protonmail.com> wrote: > I am using unicode characters in my elisp code (e.g. foreign language symbols in icelandic > and spanish). > > Is the regexp [[:word:]] appropriate to capture them ? > Although I have tried "[:multibyte:]", it did not get matches that I get with "[:word:]". I am using the regexp for constructing imenu expressions. To match ;; DN [główny] Sgn(Major), Lexik(Polish). ("Denotes" ,(concat "^;;\\s-+" "\\([[:word:]]+\\)\\s-+" "\\[\\([[:multibyte:]]+\\)\\]\\s-+") 2) ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regexp capturing unicode characters 2024-07-31 21:24 Regexp capturing unicode characters Heime 2024-07-31 21:50 ` Heime @ 2024-08-01 5:15 ` Eli Zaretskii 2024-08-01 11:26 ` Heime 1 sibling, 1 reply; 13+ messages in thread From: Eli Zaretskii @ 2024-08-01 5:15 UTC (permalink / raw) To: help-gnu-emacs > Date: Wed, 31 Jul 2024 21:24:46 +0000 > From: Heime <heimeborgia@protonmail.com> > > I am using unicode characters in my elisp code (e.g. foreign language symbols in icelandic > and spanish). > > Is the regexp [[:word:]] appropriate to capture them ? No. [[:word:]] matches characters that have the word syntax, so which characters match depends on the major mode. My suggestion is to use either [[:alnum:]] or [[:alpha:]] instead, depending on whether you want or don't want to match digit characters. The meaning of each character class is documented in the "Char Classes" node of the ELisp Reference manual, I suggest to read it and choose the most appropriate one for your needs. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regexp capturing unicode characters 2024-08-01 5:15 ` Eli Zaretskii @ 2024-08-01 11:26 ` Heime 2024-08-01 12:10 ` Eli Zaretskii 0 siblings, 1 reply; 13+ messages in thread From: Heime @ 2024-08-01 11:26 UTC (permalink / raw) To: Eli Zaretskii; +Cc: help-gnu-emacs On Thursday, August 1st, 2024 at 5:15 PM, Eli Zaretskii <eliz@gnu.org> wrote: > > Date: Wed, 31 Jul 2024 21:24:46 +0000 > > From: Heime heimeborgia@protonmail.com > > > > I am using unicode characters in my elisp code (e.g. foreign language symbols in icelandic > > and spanish). > > > > Is the regexp [[:word:]] appropriate to capture them ? > > > No. [[:word:]] matches characters that have the word syntax, so which > characters match depends on the major mode. My suggestion is to use > either [[:alnum:]] or [[:alpha:]] instead, depending on whether you > want or don't want to match digit characters. > > The meaning of each character class is documented in the "Char > Classes" node of the ELisp Reference manual, I suggest to read it and > choose the most appropriate one for your needs. It is difficult to determine from a character class, the actual character. Is there a way to show the characters that are members of each class ? Thought that [:multibyte:] captured the unicode characters. Bet even when I applied (set-buffer-multibyte t) to the buffer, I did not get matches. Does [:word:] mean word in the english language only ? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regexp capturing unicode characters 2024-08-01 11:26 ` Heime @ 2024-08-01 12:10 ` Eli Zaretskii 2024-08-01 13:43 ` Heime 0 siblings, 1 reply; 13+ messages in thread From: Eli Zaretskii @ 2024-08-01 12:10 UTC (permalink / raw) To: help-gnu-emacs > Date: Thu, 01 Aug 2024 11:26:40 +0000 > From: Heime <heimeborgia@protonmail.com> > Cc: help-gnu-emacs@gnu.org > > On Thursday, August 1st, 2024 at 5:15 PM, Eli Zaretskii <eliz@gnu.org> wrote: > > > > Date: Wed, 31 Jul 2024 21:24:46 +0000 > > > From: Heime heimeborgia@protonmail.com > > > > > > I am using unicode characters in my elisp code (e.g. foreign language symbols in icelandic > > > and spanish). > > > > > > Is the regexp [[:word:]] appropriate to capture them ? > > > > > > No. [[:word:]] matches characters that have the word syntax, so which > > characters match depends on the major mode. My suggestion is to use > > either [[:alnum:]] or [[:alpha:]] instead, depending on whether you > > want or don't want to match digit characters. > > > > The meaning of each character class is documented in the "Char > > Classes" node of the ELisp Reference manual, I suggest to read it and > > choose the most appropriate one for your needs. > > It is difficult to determine from a character class, the actual character. Why do you need that? Don't you know which characters you'd like to match? > Is there a way to show the characters that are members of each class ? No, but you can check each character whether it matches a class. > Thought that [:multibyte:] captured the unicode characters. Bet even when > I applied (set-buffer-multibyte t) to the buffer, I did not get matches. Don't use [:multibyte:], it is hardly ever the right thing nowadays. > Does [:word:] mean word in the english language only ? No, it means characters that have the word _syntax_. IOW, which character match depends on the major mode's syntax table. If you are classifying characters from human-readable text, [:word:] is not the right thing to use. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regexp capturing unicode characters 2024-08-01 12:10 ` Eli Zaretskii @ 2024-08-01 13:43 ` Heime 2024-08-01 14:30 ` Michael Heerdegen via Users list for the GNU Emacs text editor 2024-08-01 15:34 ` Eli Zaretskii 0 siblings, 2 replies; 13+ messages in thread From: Heime @ 2024-08-01 13:43 UTC (permalink / raw) To: Eli Zaretskii; +Cc: help-gnu-emacs On Friday, August 2nd, 2024 at 12:10 AM, Eli Zaretskii <eliz@gnu.org> wrote: > > Date: Thu, 01 Aug 2024 11:26:40 +0000 > > From: Heime heimeborgia@protonmail.com > > Cc: help-gnu-emacs@gnu.org > > > > On Thursday, August 1st, 2024 at 5:15 PM, Eli Zaretskii eliz@gnu.org wrote: > > > > > > Date: Wed, 31 Jul 2024 21:24:46 +0000 > > > > From: Heime heimeborgia@protonmail.com > > > > > > > > I am using unicode characters in my elisp code (e.g. foreign language symbols in icelandic > > > > and spanish). > > > > > > > > Is the regexp [[:word:]] appropriate to capture them ? > > > > > > No. [[:word:]] matches characters that have the word syntax, so which > > > characters match depends on the major mode. My suggestion is to use > > > either [[:alnum:]] or [[:alpha:]] instead, depending on whether you > > > want or don't want to match digit characters. > > > > > > The meaning of each character class is documented in the "Char > > > Classes" node of the ELisp Reference manual, I suggest to read it and > > > choose the most appropriate one for your needs. > > > > It is difficult to determine from a character class, the actual character. > > Why do you need that? Don't you know which characters you'd like to > match? No, because language insertion in emacs depends upon the user. But I want to match foreign language characters mostly. > > Is there a way to show the characters that are members of each class ? > > No, but you can check each character whether it matches a class. What is the function name for doing that ? Can one scan the buffer and list the matched character classes ? > > Thought that [:multibyte:] captured the unicode characters. Bet even when > > I applied (set-buffer-multibyte t) to the buffer, I did not get matches. > > Don't use [:multibyte:], it is hardly ever the right thing nowadays. Can we update the manual with useful information such as with [:multibyte:] please. > > Does [:word:] mean word in the english language only ? > > > No, it means characters that have the word syntax. IOW, which > character match depends on the major mode's syntax table. If you are > classifying characters from human-readable text, [:word:] is not the > right thing to use. Can one show the syntax table ? For me it is just word syntax table does not give me enough information. Perhaps give more explanation in the manual. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regexp capturing unicode characters 2024-08-01 13:43 ` Heime @ 2024-08-01 14:30 ` Michael Heerdegen via Users list for the GNU Emacs text editor 2024-08-01 15:34 ` Eli Zaretskii 1 sibling, 0 replies; 13+ messages in thread From: Michael Heerdegen via Users list for the GNU Emacs text editor @ 2024-08-01 14:30 UTC (permalink / raw) To: help-gnu-emacs Heime <heimeborgia@protonmail.com> writes: > Can one show the syntax table ? Try C-h s. Michael. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regexp capturing unicode characters 2024-08-01 13:43 ` Heime 2024-08-01 14:30 ` Michael Heerdegen via Users list for the GNU Emacs text editor @ 2024-08-01 15:34 ` Eli Zaretskii 2024-08-01 17:06 ` Heime 1 sibling, 1 reply; 13+ messages in thread From: Eli Zaretskii @ 2024-08-01 15:34 UTC (permalink / raw) To: help-gnu-emacs > Date: Thu, 01 Aug 2024 13:43:20 +0000 > From: Heime <heimeborgia@protonmail.com> > Cc: help-gnu-emacs@gnu.org > > > Why do you need that? Don't you know which characters you'd like to > > match? > > No, because language insertion in emacs depends upon the user. But I want > to match foreign language characters mostly. If by "foreign language characters" you mean letters and digits, then [:alnum:] is what you want, as I already suggested. This covers all the characters that are either letters or digits, in all the languages. > > > Is there a way to show the characters that are members of each class ? > > > > No, but you can check each character whether it matches a class. > > What is the function name for doing that ? string-match-p if you have a string or looking-at-p if you have it in the buffer. > Can one scan the buffer and list the matched character classes ? Character classes overlap, so I'm not sure what kind of function you want, and I don't think we have it anyway. It's usually the other way around: the author of a Lisp program knows in advance what kinds of characters the program needs to match, and uses a regexp which will do the job. > > > Thought that [:multibyte:] captured the unicode characters. Bet even when > > > I applied (set-buffer-multibyte t) to the buffer, I did not get matches. > > > > Don't use [:multibyte:], it is hardly ever the right thing nowadays. > > Can we update the manual with useful information such as with [:multibyte:] please. The useful information is already there (including a cross-reference to a detailed description of what "multibyte" means). I just translated it into simpler terms, based on what you told about the job you want to do, to save you from the need to read that if you don't want to. > > > Does [:word:] mean word in the english language only ? > > > > > > No, it means characters that have the word syntax. IOW, which > > character match depends on the major mode's syntax table. If you are > > classifying characters from human-readable text, [:word:] is not the > > right thing to use. > > Can one show the syntax table ? For me it is just word syntax table does > not give me enough information. Perhaps give more explanation in the manual. The manual already does that: there's a cross-reference in the description of [:word:] which leads to the node "Syntax Class Table", which explains syntax tables in detail. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regexp capturing unicode characters 2024-08-01 15:34 ` Eli Zaretskii @ 2024-08-01 17:06 ` Heime 2024-08-01 17:46 ` Eli Zaretskii 0 siblings, 1 reply; 13+ messages in thread From: Heime @ 2024-08-01 17:06 UTC (permalink / raw) To: Eli Zaretskii; +Cc: help-gnu-emacs On Friday, August 2nd, 2024 at 3:34 AM, Eli Zaretskii <eliz@gnu.org> wrote: > > Date: Thu, 01 Aug 2024 13:43:20 +0000 > > From: Heime heimeborgia@protonmail.com > > Cc: help-gnu-emacs@gnu.org > > > > > Why do you need that? Don't you know which characters you'd like to > > > match? > > > > No, because language insertion in emacs depends upon the user. But I want > > to match foreign language characters mostly. > > > If by "foreign language characters" you mean letters and digits, then > [:alnum:] is what you want, as I already suggested. This covers all > the characters that are either letters or digits, in all the > languages. > > > > > Is there a way to show the characters that are members of each class ? > > > > > > No, but you can check each character whether it matches a class. > > > > What is the function name for doing that ? > > > string-match-p if you have a string or looking-at-p if you have it in > the buffer. > > > Can one scan the buffer and list the matched character classes ? > > > Character classes overlap, so I'm not sure what kind of function you > want, and I don't think we have it anyway. It's usually the other way > around: the author of a Lisp program knows in advance what kinds of > characters the program needs to match, and uses a regexp which will do > the job. I want to include in the regexp the possibility that the user wrote some comment in a foreign language other than english. Otherwise the regexp would simply skip them. And your suggestion has been [alpha] and [:alnum:]. > > > > Thought that [:multibyte:] captured the unicode characters. Bet even when > > > > I applied (set-buffer-multibyte t) to the buffer, I did not get matches. > > > > > > Don't use [:multibyte:], it is hardly ever the right thing nowadays. > > > > Can we update the manual with useful information such as with [:multibyte:] please. > > > The useful information is already there (including a cross-reference > to a detailed description of what "multibyte" means). I just > translated it into simpler terms, based on what you told about the job > you want to do, to save you from the need to read that if you don't > want to. A mention that [:multibyte:] is not used much nowadays. > > > > Does [:word:] mean word in the english language only ? > > > > > > No, it means characters that have the word syntax. IOW, which > > > character match depends on the major mode's syntax table. If you are > > > classifying characters from human-readable text, [:word:] is not the > > > right thing to use. > > > Can one show the syntax table ? For me it is just word syntax table does > > not give me enough information. Perhaps give more explanation in the manual. > > The manual already does that: there's a cross-reference in the > description of [:word:] which leads to the node "Syntax Class Table", > which explains syntax tables in detail. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regexp capturing unicode characters 2024-08-01 17:06 ` Heime @ 2024-08-01 17:46 ` Eli Zaretskii 2024-08-01 19:44 ` Heime 0 siblings, 1 reply; 13+ messages in thread From: Eli Zaretskii @ 2024-08-01 17:46 UTC (permalink / raw) To: help-gnu-emacs > Date: Thu, 01 Aug 2024 17:06:26 +0000 > From: Heime <heimeborgia@protonmail.com> > Cc: help-gnu-emacs@gnu.org > > On Friday, August 2nd, 2024 at 3:34 AM, Eli Zaretskii <eliz@gnu.org> wrote: > > I want to include in the regexp the possibility that the user wrote some > comment in a foreign language other than english. Otherwise the regexp > would simply skip them. And your suggestion has been [alpha] and [:alnum:]. Once again, [:alpha:] and [:alnum:] will match letters and digits in any language, not just in English. > > The useful information is already there (including a cross-reference > > to a detailed description of what "multibyte" means). I just > > translated it into simpler terms, based on what you told about the job > > you want to do, to save you from the need to read that if you don't > > want to. > > A mention that [:multibyte:] is not used much nowadays. That's not what I said. I said it is almost never the right thing nowadays, especially in your case. I'm trying to help you by saying simplified things. The manual doesn't simplify, because it's a reference. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Regexp capturing unicode characters 2024-08-01 17:46 ` Eli Zaretskii @ 2024-08-01 19:44 ` Heime 2024-08-02 5:44 ` Eli Zaretskii 0 siblings, 1 reply; 13+ messages in thread From: Heime @ 2024-08-01 19:44 UTC (permalink / raw) To: Eli Zaretskii; +Cc: help-gnu-emacs On Friday, August 2nd, 2024 at 5:46 AM, Eli Zaretskii <eliz@gnu.org> wrote: > > Date: Thu, 01 Aug 2024 17:06:26 +0000 > > From: Heime heimeborgia@protonmail.com > > Cc: help-gnu-emacs@gnu.org > > > > On Friday, August 2nd, 2024 at 3:34 AM, Eli Zaretskii eliz@gnu.org wrote: > > > > I want to include in the regexp the possibility that the user wrote some > > comment in a foreign language other than english. Otherwise the regexp > > would simply skip them. And your suggestion has been [alpha] and [:alnum:]. > > > Once again, [:alpha:] and [:alnum:] will match letters and digits in > any language, not just in English. > > > > The useful information is already there (including a cross-reference > > > to a detailed description of what "multibyte" means). I just > > > translated it into simpler terms, based on what you told about the job > > > you want to do, to save you from the need to read that if you don't > > > want to. > > > > A mention that [:multibyte:] is not used much nowadays. > > > That's not what I said. I said it is almost never the right thing > nowadays, especially in your case. > > I'm trying to help you by saying simplified things. The manual > doesn't simplify, because it's a reference. Would graph [:graph:] be the most powerful ? In "34.2 Disabling Multibyte Characters", it is stated "Multibyte mode allows you to use all the supported languages and scripts without limitations." Yet you say that it is never the right thing especially in my case. Where in my case I want to support languages without limitations. I did not find the reference is enough to decide what is appropriate to use for languages without limitations, or for specific languages. Mainly because I would not know what the classes include exactly. Have read 34.1 Text Representtions 34.7 Character Sets 36.2.1 Table of Syntax Classes and 35.3.1.1 Special Characters in Regular Expressions 35.3.1.2 Character Classes 35.3.1.3 Backslash Constructs in Regular Expressions Would I have missed other things important to the discussion ? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regexp capturing unicode characters 2024-08-01 19:44 ` Heime @ 2024-08-02 5:44 ` Eli Zaretskii 2024-08-02 8:03 ` uzibalqa 0 siblings, 1 reply; 13+ messages in thread From: Eli Zaretskii @ 2024-08-02 5:44 UTC (permalink / raw) To: help-gnu-emacs > Date: Thu, 01 Aug 2024 19:44:18 +0000 > From: Heime <heimeborgia@protonmail.com> > Cc: help-gnu-emacs@gnu.org > > On Friday, August 2nd, 2024 at 5:46 AM, Eli Zaretskii <eliz@gnu.org> wrote: > > > Once again, [:alpha:] and [:alnum:] will match letters and digits in > > any language, not just in English. > > > > > > The useful information is already there (including a cross-reference > > > > to a detailed description of what "multibyte" means). I just > > > > translated it into simpler terms, based on what you told about the job > > > > you want to do, to save you from the need to read that if you don't > > > > want to. > > > > > > A mention that [:multibyte:] is not used much nowadays. > > > > > > That's not what I said. I said it is almost never the right thing > > nowadays, especially in your case. > > > > I'm trying to help you by saying simplified things. The manual > > doesn't simplify, because it's a reference. > > Would graph [:graph:] be the most powerful ? [:graph:] includes punctuation and other symbols, which AFAIU you don't want to match (since you thought [:word:] is what you need). > In "34.2 Disabling Multibyte Characters", it is stated > > "Multibyte mode allows you to use all the supported languages > and scripts without limitations." That's not really relevant to the issue at hand. Yes, multibyte characters are needed to support all the languages. No, that doesn't mean you need to use [:multibyte:], because that will match punctuation, symbols, non-ASCII control and whitespace characters, etc., and you don't want that. OTOH, [:multibyte:] doesn't match ASCII letters and digits, and you certainly do want to match them. > Yet you say that it is never the right thing especially in my case. > Where in my case I want to support languages without limitations. Yes, and [:alpha:] and [:alnum:] support languages without limitations. As I already told you several times. > I did not find the reference is enough to decide what is appropriate > to use for languages without limitations, or for specific languages. > Mainly because I would not know what the classes include exactly. I tried to help you with specific advice, but you insist on not listening. So this will be my last message in this thread. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regexp capturing unicode characters 2024-08-02 5:44 ` Eli Zaretskii @ 2024-08-02 8:03 ` uzibalqa 0 siblings, 0 replies; 13+ messages in thread From: uzibalqa @ 2024-08-02 8:03 UTC (permalink / raw) To: Eli Zaretskii; +Cc: help-gnu-emacs On Friday, August 2nd, 2024 at 5:44 PM, Eli Zaretskii <eliz@gnu.org> wrote: > > Date: Thu, 01 Aug 2024 19:44:18 +0000 > > From: Heime heimeborgia@protonmail.com > > Cc: help-gnu-emacs@gnu.org > > > > On Friday, August 2nd, 2024 at 5:46 AM, Eli Zaretskii eliz@gnu.org wrote: > > > > > Once again, [:alpha:] and [:alnum:] will match letters and digits in > > > any language, not just in English. > > > > > > > > The useful information is already there (including a cross-reference > > > > > to a detailed description of what "multibyte" means). I just > > > > > translated it into simpler terms, based on what you told about the job > > > > > you want to do, to save you from the need to read that if you don't > > > > > want to. > > > > > > > > A mention that [:multibyte:] is not used much nowadays. > > > > > > That's not what I said. I said it is almost never the right thing > > > nowadays, especially in your case. > > > > > > I'm trying to help you by saying simplified things. The manual > > > doesn't simplify, because it's a reference. > > > > Would graph [:graph:] be the most powerful ? > > > [:graph:] includes punctuation and other symbols, which AFAIU you > don't want to match (since you thought [:word:] is what you need). > > > In "34.2 Disabling Multibyte Characters", it is stated > > > > "Multibyte mode allows you to use all the supported languages > > and scripts without limitations." > > > That's not really relevant to the issue at hand. Yes, multibyte > characters are needed to support all the languages. No, that doesn't > mean you need to use [:multibyte:], because that will match > punctuation, symbols, non-ASCII control and whitespace characters, > etc., and you don't want that. OTOH, [:multibyte:] doesn't match > ASCII letters and digits, and you certainly do want to match them. > > > Yet you say that it is never the right thing especially in my case. > > Where in my case I want to support languages without limitations. > > > Yes, and [:alpha:] and [:alnum:] support languages without > limitations. As I already told you several times. > > > I did not find the reference is enough to decide what is appropriate > > to use for languages without limitations, or for specific languages. > > Mainly because I would not know what the classes include exactly. > > > I tried to help you with specific advice, but you insist on not > listening. So this will be my last message in this thread. I listen, but also wanted reason so I can reach the same conclusion. I accept the elaboration, which I could not conclude by myself based only on the manual descriptions. That was all it was about. Then there is ".*" which is very broad and flexible. It can match spaces, punctuation, and special characters. Would this constitute the broadest thing ? I am using to pick on anything. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-08-02 8:03 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-07-31 21:24 Regexp capturing unicode characters Heime 2024-07-31 21:50 ` Heime 2024-08-01 5:15 ` Eli Zaretskii 2024-08-01 11:26 ` Heime 2024-08-01 12:10 ` Eli Zaretskii 2024-08-01 13:43 ` Heime 2024-08-01 14:30 ` Michael Heerdegen via Users list for the GNU Emacs text editor 2024-08-01 15:34 ` Eli Zaretskii 2024-08-01 17:06 ` Heime 2024-08-01 17:46 ` Eli Zaretskii 2024-08-01 19:44 ` Heime 2024-08-02 5:44 ` Eli Zaretskii 2024-08-02 8:03 ` uzibalqa
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.