* Strange whitespace remains after emoji regexp replace
@ 2024-12-25 11:38 Jean Louis
2024-12-25 12:51 ` Eli Zaretskii
0 siblings, 1 reply; 3+ messages in thread
From: Jean Louis @ 2024-12-25 11:38 UTC (permalink / raw)
To: Help GNU Emacs
THere is this function:
(defun wrs-search-clean-entry (entry)
"Clean and normalize a ENTRY string.
Prepare it for easier searching"
(let* ((entry (replace-regexp-in-string (rx (one-or-more (or (not alnum) "\n" blank))) " " entry))
(entry (replace-regexp-in-string (rx (one-or-more " ")) " " entry))
(string-trim entry))
entry))
And now this emoji here, probably, creates some strange wide white
space. I do not know if anybody can see that wide whitespace, it is
invisible though it comes after the first quote in the result
(wrs-search-clean-entry "☺️ )(**(&&^%^$##@!))") ➜ " ️ "
It is in the above position, same as X in the below position:
(wrs-search-clean-entry "☺️ )(**(&&^%^$##@!))") ➜ "X "
M-x describe-char
gives me:
position: 800 of 923 (87%), column: 50
character: SPC (displayed as SPC) (codepoint 32, #o40, #x20)
charset: ascii (ASCII (ISO646 IRV))
code point in charset: 0x20
script: latin
syntax: which means: whitespace
category: .:Base, a:ASCII, l:Latin
to input: type "C-x 8 RET 20" or "C-x 8 RET SPACE"
buffer code: #x20
file code: not encodable by coding system nil
display: composed to form " ️" (see below)
Composed with the following character(s) "️" using this font:
ftcrhb:-GOOG-Noto Color Emoji-regular-normal-normal-*-23-*-*-*-m-0-iso10646-1
by these glyphs:
[0 1 32 3 29 0 0 0 0 nil]
[0 1 65039 3 29 0 0 0 0 [0 0 0]]
with these character(s):
️ (#xfe0f) VARIATION SELECTOR-16
Character code properties: customize what to show
name: SPACE
general-category: Zs (Separator, Space)
decomposition: (32) (' ')
There are text properties here:
fontified t
The difference to normal space is that it has some ️ (#xfe0f)
VARIATION SELECTOR-16
But I don't want it. I want to clean EVERYTHING what is not
alpha-numeric from the string.
How do I make sure of it?
JEan Louis
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Strange whitespace remains after emoji regexp replace
2024-12-25 11:38 Strange whitespace remains after emoji regexp replace Jean Louis
@ 2024-12-25 12:51 ` Eli Zaretskii
2024-12-25 13:44 ` SOLVED - " Jean Louis
0 siblings, 1 reply; 3+ messages in thread
From: Eli Zaretskii @ 2024-12-25 12:51 UTC (permalink / raw)
To: help-gnu-emacs
> Date: Wed, 25 Dec 2024 14:38:14 +0300
> From: Jean Louis <bugs@gnu.support>
>
> THere is this function:
>
> (defun wrs-search-clean-entry (entry)
> "Clean and normalize a ENTRY string.
>
> Prepare it for easier searching"
> (let* ((entry (replace-regexp-in-string (rx (one-or-more (or (not alnum) "\n" blank))) " " entry))
> (entry (replace-regexp-in-string (rx (one-or-more " ")) " " entry))
> (string-trim entry))
> entry))
>
> And now this emoji here, probably, creates some strange wide white
> space. I do not know if anybody can see that wide whitespace, it is
> invisible though it comes after the first quote in the result
>
> (wrs-search-clean-entry "☺️ )(**(&&^%^$##@!))") ➜ " ️ "
>
> It is in the above position, same as X in the below position:
> (wrs-search-clean-entry "☺️ )(**(&&^%^$##@!))") ➜ "X "
>
> M-x describe-char
>
> gives me:
>
> position: 800 of 923 (87%), column: 50
> character: SPC (displayed as SPC) (codepoint 32, #o40, #x20)
> charset: ascii (ASCII (ISO646 IRV))
> code point in charset: 0x20
> script: latin
> syntax: which means: whitespace
> category: .:Base, a:ASCII, l:Latin
> to input: type "C-x 8 RET 20" or "C-x 8 RET SPACE"
> buffer code: #x20
> file code: not encodable by coding system nil
> display: composed to form " ️" (see below)
>
> Composed with the following character(s) "️" using this font:
> ftcrhb:-GOOG-Noto Color Emoji-regular-normal-normal-*-23-*-*-*-m-0-iso10646-1
> by these glyphs:
> [0 1 32 3 29 0 0 0 0 nil]
> [0 1 65039 3 29 0 0 0 0 [0 0 0]]
> with these character(s):
> ️ (#xfe0f) VARIATION SELECTOR-16
>
> Character code properties: customize what to show
> name: SPACE
> general-category: Zs (Separator, Space)
> decomposition: (32) (' ')
>
> There are text properties here:
> fontified t
>
> The difference to normal space is that it has some ️ (#xfe0f)
> VARIATION SELECTOR-16
>
> But I don't want it. I want to clean EVERYTHING what is not
> alpha-numeric from the string.
>
> How do I make sure of it?
Remove the VS-16 character as well, how else?
^ permalink raw reply [flat|nested] 3+ messages in thread
* SOLVED - Re: Strange whitespace remains after emoji regexp replace
2024-12-25 12:51 ` Eli Zaretskii
@ 2024-12-25 13:44 ` Jean Louis
0 siblings, 0 replies; 3+ messages in thread
From: Jean Louis @ 2024-12-25 13:44 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: help-gnu-emacs
* Eli Zaretskii <eliz@gnu.org> [2024-12-25 15:53]:
> > The difference to normal space is that it has some ️ (#xfe0f)
> > VARIATION SELECTOR-16
> >
> > But I don't want it. I want to clean EVERYTHING what is not
> > alpha-numeric from the string.
> >
> > How do I make sure of it?
>
> Remove the VS-16 character as well, how else?
For you natural, for me mystery, though thanks 🙏👍💬, I got it now
after all.
I have found the answer online, that it is special character 📝 \uFE0F
and without online assistance, I would never find it. 💻
(defun wrs-search-clean-entry (entry)
"Clean and normalize a ENTRY string.
Prepare it for easier searching"
(let* ((entry (replace-regexp-in-string (rx (one-or-more (or (not alnum) "\n" blank))) " " entry))
(entry (replace-regexp-in-string "[\u200B-\u200D\uFEFF\uFE0F]" " " entry))
(entry (replace-regexp-in-string (rx (one-or-more " ")) " " entry))
(entry (string-trim entry)))
entry))
It is \uFE0F
Problem solved for now! 🙌
Thanks much! 😊
Jean Louis
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2024-12-25 13:44 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-25 11:38 Strange whitespace remains after emoji regexp replace Jean Louis
2024-12-25 12:51 ` Eli Zaretskii
2024-12-25 13:44 ` SOLVED - " Jean Louis
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).