Parsing of multibyte strings frpom process output

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Parsing of multibyte strings frpom process output
@ 2018-05-08 10:02 Michael Albinus
  0 siblings, 0 replies; 5+ messages in thread
From: Michael Albinus @ 2018-05-08 10:02 UTC (permalink / raw)
  To: help-gnu-emacs

Hi,

I call a local process ("gio list ...", to name it), which returns utf8
multibyte codes like

--8<---------------cut here---------------start------------->8---
standard::symlink-target=/home/albinus/tmp/\xc2\x9abung
--8<---------------cut here---------------end--------------->8---

The bytes "\xc2\x9a" stand for the multibyte char ?\x9a. However, I
don't know how to parse it that I could retrieve it. All what I have
tried returns always the *two* characters ?\xc2 ?\x9a, multibyte
encoded. How could I get just the multibyte character ?\x9a from this?

I know that (decode-coding-string "/home/albinus/tmp/\xc2\x9a\ bung" 'utf-8)
does what I want. But here, the string is a string *constant*, which
allows to write characters in hex syntax. When I read the string from
the output buffer (after including the trailing "\ "), this does not work.

Best regards, Michael.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Parsing of multibyte strings frpom process output
       [not found] <mailman.13544.1525773753.27995.help-gnu-emacs@gnu.org>
@ 2018-05-08 11:00 ` Helmut Eller
  2018-05-08 12:01   ` Michael Albinus
  0 siblings, 1 reply; 5+ messages in thread
From: Helmut Eller @ 2018-05-08 11:00 UTC (permalink / raw)
  To: help-gnu-emacs

On Tue, May 08 2018, Michael Albinus wrote:

> Hi,
>
> I call a local process ("gio list ...", to name it), which returns utf8
> multibyte codes like
>
> --8<---------------cut here---------------start------------->8---
> standard::symlink-target=/home/albinus/tmp/\xc2\x9abung
> --8<---------------cut here---------------end--------------->8---
>
> The bytes "\xc2\x9a" stand for the multibyte char ?\x9a.

The UTF-8 byte sequence \xc2\x9a is a control character.

Maybe the byte sequence \xc3\x9c would make a better example as that
corresponds to Ü (LATIN CAPITAL LETTER U WITH DIAERESIS).

> However, I
> don't know how to parse it that I could retrieve it. All what I have
> tried returns always the *two* characters ?\xc2 ?\x9a, multibyte
> encoded. How could I get just the multibyte character ?\x9a from this?

You could use (set-process-coding-system <proc> 'utf-8) if you know that
the all output of the process is indeed utf-8 encoded.

Alternatively, you could use 'binary as coding system and manually call
decode-coding-string on the parts that are utf-8 encoded.  However keep
in mind, that "raw bytes" in multibyte strings have char codes in the
range #x3FFF00..#x3FFFFF.

If you want even more confusion: you could set up the process so that it
generates unibyte strings and then use decode-coding-string to create
the multibyte string.

> I know that (decode-coding-string "/home/albinus/tmp/\xc2\x9a\ bung" 'utf-8)
> does what I want. But here, the string is a string *constant*, which
> allows to write characters in hex syntax. When I read the string from
> the output buffer (after including the trailing "\ "), this does not work.

Remember, if a hexadecimal or octal escape sequence occurs in a string
literal then the string is automatically becomes a unibyte string:

(multibyte-string-p "\xc3\x9c") => nil

Also consider these examples:

  (decode-coding-string "\xc3\x9c" 'utf-8) => "Ü"
  (decode-coding-string (string #xc3 #x9c) 'utf-8) => "Ã\234"
  (decode-coding-string (string #x3FFFc3 #x3FFF9c) 'utf-8) => "Ü"

Helmut

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Parsing of multibyte strings frpom process output
  2018-05-08 11:00 ` Parsing of multibyte strings frpom process output Helmut Eller
@ 2018-05-08 12:01   ` Michael Albinus
  2018-05-08 12:21     ` Noam Postavsky
  0 siblings, 1 reply; 5+ messages in thread
From: Michael Albinus @ 2018-05-08 12:01 UTC (permalink / raw)
  To: Helmut Eller; +Cc: help-gnu-emacs

Helmut Eller <eller.helmut@gmail.com> writes:

Hi Helmut,

>> However, I don't know how to parse it that I could retrieve it. All
>> what I have tried returns always the *two* characters ?\xc2 ?\x9a,
>> multibyte encoded. How could I get just the multibyte character ?\x9a
>> from this?
>
> You could use (set-process-coding-system <proc> 'utf-8) if you know that
> the all output of the process is indeed utf-8 encoded.

I've done this already, for other purposes. But it doesn't help, the
string /home/albinus/tmp/\xc2\x9abung is written literally into the
output buffer.

> Alternatively, you could use 'binary as coding system and manually call
> decode-coding-string on the parts that are utf-8 encoded.  However keep
> in mind, that "raw bytes" in multibyte strings have char codes in the
> range #x3FFF00..#x3FFFFF.

I tried that, with no luck. But I didn't know that "raw" bytes are in
that range.

>   (decode-coding-string (string #x3FFFc3 #x3FFF9c) 'utf-8) => "Ü"

That's it! The following code works for me (res-symlink-target keeps the
file name from process output, as shown above):

--8<---------------cut here---------------start------------->8---
(setq res-symlink-target
      ;; Parse multibyte codings.
      (decode-coding-string
       (replace-regexp-in-string
        "\\\\x\\([[:xdigit:]]\\{2\\}\\)"
        (lambda (x)
          (string
           (string-to-number (concat "3FFF" (match-string 1 x)) 16)))
        res-symlink-target)
       'utf-8))
--8<---------------cut here---------------end--------------->8---

Thanks a lot!

> Helmut

Best regards, Muichael.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Parsing of multibyte strings frpom process output
  2018-05-08 12:01   ` Michael Albinus
@ 2018-05-08 12:21     ` Noam Postavsky
  2018-05-08 12:47       ` Michael Albinus
  0 siblings, 1 reply; 5+ messages in thread
From: Noam Postavsky @ 2018-05-08 12:21 UTC (permalink / raw)
  To: Michael Albinus; +Cc: Help Gnu Emacs mailing list, Helmut Eller

On 8 May 2018 at 08:01, Michael Albinus <michael.albinus@gmx.de> wrote:

> --8<---------------cut here---------------start------------->8---
> (setq res-symlink-target
>       ;; Parse multibyte codings.
>       (decode-coding-string
>        (replace-regexp-in-string
>         "\\\\x\\([[:xdigit:]]\\{2\\}\\)"
>         (lambda (x)
>           (string
>            (string-to-number (concat "3FFF" (match-string 1 x)) 16)))
>         res-symlink-target)
>        'utf-8))
> --8<---------------cut here---------------end--------------->8---

I think making a unibyte string would fit better.

(let ((res-symlink-target "/home/albinus/tmp/\\xc2\\x9abung"))
  (setq res-symlink-target (encode-coding-string
                            res-symlink-target
                            'us-ascii t))
  (decode-coding-string
   (replace-regexp-in-string
    "\\\\x\\([[:xdigit:]]\\{2\\}\\)"
    (lambda (x)
      (unibyte-string
       (string-to-number (match-string 1 x) 16)))
    res-symlink-target)
   'utf-8))

Although, both methods give me "/home/albinus/tmp/\232bung", so I
might be missing something...



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Parsing of multibyte strings frpom process output
  2018-05-08 12:21     ` Noam Postavsky
@ 2018-05-08 12:47       ` Michael Albinus
  0 siblings, 0 replies; 5+ messages in thread
From: Michael Albinus @ 2018-05-08 12:47 UTC (permalink / raw)
  To: Noam Postavsky; +Cc: Help Gnu Emacs mailing list, Helmut Eller

Noam Postavsky <npostavs@gmail.com> writes:

> I think making a unibyte string would fit better.
>
> (let ((res-symlink-target "/home/albinus/tmp/\\xc2\\x9abung"))
>   (setq res-symlink-target (encode-coding-string
>                             res-symlink-target
>                             'us-ascii t))
>   (decode-coding-string
>    (replace-regexp-in-string
>     "\\\\x\\([[:xdigit:]]\\{2\\}\\)"
>     (lambda (x)
>       (unibyte-string
>        (string-to-number (match-string 1 x) 16)))
>     res-symlink-target)
>    'utf-8))
>
> Although, both methods give me "/home/albinus/tmp/\232bung", so I
> might be missing something...

Both methods are doing the same, I believe. But I like your snippet much
better than mine, so I will use it (if you don't object).

Your first part (encoding in us-ascii) doesn't seem to be necessary in
my case, because the gio tool prints all non-ascii characters of the
symbolic link file name in the syntax I've shown.

Thanks, and best regards, Michael.



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-05-08 12:47 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <mailman.13544.1525773753.27995.help-gnu-emacs@gnu.org>
2018-05-08 11:00 ` Parsing of multibyte strings frpom process output Helmut Eller
2018-05-08 12:01   ` Michael Albinus
2018-05-08 12:21     ` Noam Postavsky
2018-05-08 12:47       ` Michael Albinus
2018-05-08 10:02 Michael Albinus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).