* Re: Parsing of multibyte strings frpom process output
[not found] <mailman.13544.1525773753.27995.help-gnu-emacs@gnu.org>
@ 2018-05-08 11:00 ` Helmut Eller
2018-05-08 12:01 ` Michael Albinus
0 siblings, 1 reply; 5+ messages in thread
From: Helmut Eller @ 2018-05-08 11:00 UTC (permalink / raw)
To: help-gnu-emacs
On Tue, May 08 2018, Michael Albinus wrote:
> Hi,
>
> I call a local process ("gio list ...", to name it), which returns utf8
> multibyte codes like
>
> --8<---------------cut here---------------start------------->8---
> standard::symlink-target=/home/albinus/tmp/\xc2\x9abung
> --8<---------------cut here---------------end--------------->8---
>
> The bytes "\xc2\x9a" stand for the multibyte char ?\x9a.
The UTF-8 byte sequence \xc2\x9a is a control character.
Maybe the byte sequence \xc3\x9c would make a better example as that
corresponds to Ü (LATIN CAPITAL LETTER U WITH DIAERESIS).
> However, I
> don't know how to parse it that I could retrieve it. All what I have
> tried returns always the *two* characters ?\xc2 ?\x9a, multibyte
> encoded. How could I get just the multibyte character ?\x9a from this?
You could use (set-process-coding-system <proc> 'utf-8) if you know that
the all output of the process is indeed utf-8 encoded.
Alternatively, you could use 'binary as coding system and manually call
decode-coding-string on the parts that are utf-8 encoded. However keep
in mind, that "raw bytes" in multibyte strings have char codes in the
range #x3FFF00..#x3FFFFF.
If you want even more confusion: you could set up the process so that it
generates unibyte strings and then use decode-coding-string to create
the multibyte string.
> I know that (decode-coding-string "/home/albinus/tmp/\xc2\x9a\ bung" 'utf-8)
> does what I want. But here, the string is a string *constant*, which
> allows to write characters in hex syntax. When I read the string from
> the output buffer (after including the trailing "\ "), this does not work.
Remember, if a hexadecimal or octal escape sequence occurs in a string
literal then the string is automatically becomes a unibyte string:
(multibyte-string-p "\xc3\x9c") => nil
Also consider these examples:
(decode-coding-string "\xc3\x9c" 'utf-8) => "Ü"
(decode-coding-string (string #xc3 #x9c) 'utf-8) => "Ã\234"
(decode-coding-string (string #x3FFFc3 #x3FFF9c) 'utf-8) => "Ü"
Helmut
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Parsing of multibyte strings frpom process output
2018-05-08 11:00 ` Parsing of multibyte strings frpom process output Helmut Eller
@ 2018-05-08 12:01 ` Michael Albinus
2018-05-08 12:21 ` Noam Postavsky
0 siblings, 1 reply; 5+ messages in thread
From: Michael Albinus @ 2018-05-08 12:01 UTC (permalink / raw)
To: Helmut Eller; +Cc: help-gnu-emacs
Helmut Eller <eller.helmut@gmail.com> writes:
Hi Helmut,
>> However, I don't know how to parse it that I could retrieve it. All
>> what I have tried returns always the *two* characters ?\xc2 ?\x9a,
>> multibyte encoded. How could I get just the multibyte character ?\x9a
>> from this?
>
> You could use (set-process-coding-system <proc> 'utf-8) if you know that
> the all output of the process is indeed utf-8 encoded.
I've done this already, for other purposes. But it doesn't help, the
string /home/albinus/tmp/\xc2\x9abung is written literally into the
output buffer.
> Alternatively, you could use 'binary as coding system and manually call
> decode-coding-string on the parts that are utf-8 encoded. However keep
> in mind, that "raw bytes" in multibyte strings have char codes in the
> range #x3FFF00..#x3FFFFF.
I tried that, with no luck. But I didn't know that "raw" bytes are in
that range.
> (decode-coding-string (string #x3FFFc3 #x3FFF9c) 'utf-8) => "Ü"
That's it! The following code works for me (res-symlink-target keeps the
file name from process output, as shown above):
--8<---------------cut here---------------start------------->8---
(setq res-symlink-target
;; Parse multibyte codings.
(decode-coding-string
(replace-regexp-in-string
"\\\\x\\([[:xdigit:]]\\{2\\}\\)"
(lambda (x)
(string
(string-to-number (concat "3FFF" (match-string 1 x)) 16)))
res-symlink-target)
'utf-8))
--8<---------------cut here---------------end--------------->8---
Thanks a lot!
> Helmut
Best regards, Muichael.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Parsing of multibyte strings frpom process output
2018-05-08 12:01 ` Michael Albinus
@ 2018-05-08 12:21 ` Noam Postavsky
2018-05-08 12:47 ` Michael Albinus
0 siblings, 1 reply; 5+ messages in thread
From: Noam Postavsky @ 2018-05-08 12:21 UTC (permalink / raw)
To: Michael Albinus; +Cc: Help Gnu Emacs mailing list, Helmut Eller
On 8 May 2018 at 08:01, Michael Albinus <michael.albinus@gmx.de> wrote:
> --8<---------------cut here---------------start------------->8---
> (setq res-symlink-target
> ;; Parse multibyte codings.
> (decode-coding-string
> (replace-regexp-in-string
> "\\\\x\\([[:xdigit:]]\\{2\\}\\)"
> (lambda (x)
> (string
> (string-to-number (concat "3FFF" (match-string 1 x)) 16)))
> res-symlink-target)
> 'utf-8))
> --8<---------------cut here---------------end--------------->8---
I think making a unibyte string would fit better.
(let ((res-symlink-target "/home/albinus/tmp/\\xc2\\x9abung"))
(setq res-symlink-target (encode-coding-string
res-symlink-target
'us-ascii t))
(decode-coding-string
(replace-regexp-in-string
"\\\\x\\([[:xdigit:]]\\{2\\}\\)"
(lambda (x)
(unibyte-string
(string-to-number (match-string 1 x) 16)))
res-symlink-target)
'utf-8))
Although, both methods give me "/home/albinus/tmp/\232bung", so I
might be missing something...
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Parsing of multibyte strings frpom process output
2018-05-08 12:21 ` Noam Postavsky
@ 2018-05-08 12:47 ` Michael Albinus
0 siblings, 0 replies; 5+ messages in thread
From: Michael Albinus @ 2018-05-08 12:47 UTC (permalink / raw)
To: Noam Postavsky; +Cc: Help Gnu Emacs mailing list, Helmut Eller
Noam Postavsky <npostavs@gmail.com> writes:
> I think making a unibyte string would fit better.
>
> (let ((res-symlink-target "/home/albinus/tmp/\\xc2\\x9abung"))
> (setq res-symlink-target (encode-coding-string
> res-symlink-target
> 'us-ascii t))
> (decode-coding-string
> (replace-regexp-in-string
> "\\\\x\\([[:xdigit:]]\\{2\\}\\)"
> (lambda (x)
> (unibyte-string
> (string-to-number (match-string 1 x) 16)))
> res-symlink-target)
> 'utf-8))
>
> Although, both methods give me "/home/albinus/tmp/\232bung", so I
> might be missing something...
Both methods are doing the same, I believe. But I like your snippet much
better than mine, so I will use it (if you don't object).
Your first part (encoding in us-ascii) doesn't seem to be necessary in
my case, because the gio tool prints all non-ascii characters of the
symbolic link file name in the syntax I've shown.
Thanks, and best regards, Michael.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Parsing of multibyte strings frpom process output
@ 2018-05-08 10:02 Michael Albinus
0 siblings, 0 replies; 5+ messages in thread
From: Michael Albinus @ 2018-05-08 10:02 UTC (permalink / raw)
To: help-gnu-emacs
Hi,
I call a local process ("gio list ...", to name it), which returns utf8
multibyte codes like
--8<---------------cut here---------------start------------->8---
standard::symlink-target=/home/albinus/tmp/\xc2\x9abung
--8<---------------cut here---------------end--------------->8---
The bytes "\xc2\x9a" stand for the multibyte char ?\x9a. However, I
don't know how to parse it that I could retrieve it. All what I have
tried returns always the *two* characters ?\xc2 ?\x9a, multibyte
encoded. How could I get just the multibyte character ?\x9a from this?
I know that (decode-coding-string "/home/albinus/tmp/\xc2\x9a\ bung" 'utf-8)
does what I want. But here, the string is a string *constant*, which
allows to write characters in hex syntax. When I read the string from
the output buffer (after including the trailing "\ "), this does not work.
Best regards, Michael.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-05-08 12:47 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <mailman.13544.1525773753.27995.help-gnu-emacs@gnu.org>
2018-05-08 11:00 ` Parsing of multibyte strings frpom process output Helmut Eller
2018-05-08 12:01 ` Michael Albinus
2018-05-08 12:21 ` Noam Postavsky
2018-05-08 12:47 ` Michael Albinus
2018-05-08 10:02 Michael Albinus
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).