* Fixing Gnus, and string encoding question
@ 2019-04-05 20:47 Eric Abrahamsen
2019-04-06 1:40 ` Noam Postavsky
2019-04-06 8:26 ` Andreas Schwab
0 siblings, 2 replies; 14+ messages in thread
From: Eric Abrahamsen @ 2019-04-05 20:47 UTC (permalink / raw)
To: emacs-devel
So I've made a hash of this change (ha), and am trying to figure out the
best solution.
The problem is that non-ASCII group names are now strings, and are
coming into the system in two different ways: written into .newsrc.eld
with `print-escape-nonascii' set to t, and read off the filesystem using
a buffer with mutibyte disabled. The two methods don't match up -- the
strings are different.
Katsumi Yamaoka's example is the group whose decoded name is "nnml:テス
ト". This is written to .newsrc.eld as the string:
"nnml:\343\203\206\343\202\271\343\203\210"
Those aren't actual escapes, just backslashes and numbers.
The group name is read from file with `set-buffer-multibyte' nil, using
`read' to pick the group name up as a symbol, then using `symbol-name'
to turn it into a string. The symbol looks like:
nnml:\343\203\206\343\202\271\343\203\210
And the resulting string is:
"nnml:ã\203\206ã\202¹ã\203\210"
Where the escapes are real escapes, I've typed them out here. The two
strings aren't `equal', obviously.
I don't know how to turn either of these strings into the other --
either direction would work, but I don't know how.
Another option is to give up messing with strings, and back the changes
halfway out: still use hash tables, but leave the group names as
symbols, with their current funky encoding. That's probably how I should
have sliced these changes to begin with. Then a later step would be to
go straight from symbols to fully decoded strings.
Hoping for some guidance,
Eric
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fixing Gnus, and string encoding question
2019-04-05 20:47 Fixing Gnus, and string encoding question Eric Abrahamsen
@ 2019-04-06 1:40 ` Noam Postavsky
2019-04-06 2:22 ` Eric Abrahamsen
2019-04-06 8:26 ` Andreas Schwab
1 sibling, 1 reply; 14+ messages in thread
From: Noam Postavsky @ 2019-04-06 1:40 UTC (permalink / raw)
To: Eric Abrahamsen; +Cc: Emacs developers
On Fri, 5 Apr 2019 at 16:50, Eric Abrahamsen <eric@ericabrahamsen.net> wrote:
> Katsumi Yamaoka's example is the group whose decoded name is "nnml:テス
> ト". This is written to .newsrc.eld as the string:
>
> "nnml:\343\203\206\343\202\271\343\203\210"
> nnml:\343\203\206\343\202\271\343\203\210
> "nnml:ã\203\206ã\202¹ã\203\210"
> I don't know how to turn either of these strings into the other --
> either direction would work, but I don't know how.
Are you maybe looking for decode-coding-string?
(decode-coding-string
"nnml:\343\203\206\343\202\271\343\203\210" 'utf-8) ;=> "nnml:テスト"
(decode-coding-string
(symbol-name (read "nnml:\343\203\206\343\202\271\343\203\210"))
'utf-8) ;=> "nnml:テスト"
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fixing Gnus, and string encoding question
2019-04-06 1:40 ` Noam Postavsky
@ 2019-04-06 2:22 ` Eric Abrahamsen
2019-04-06 3:56 ` Noam Postavsky
2019-04-06 6:20 ` Eli Zaretskii
0 siblings, 2 replies; 14+ messages in thread
From: Eric Abrahamsen @ 2019-04-06 2:22 UTC (permalink / raw)
To: Noam Postavsky; +Cc: Emacs developers
Noam Postavsky <npostavs@gmail.com> writes:
> On Fri, 5 Apr 2019 at 16:50, Eric Abrahamsen <eric@ericabrahamsen.net> wrote:
>
>> Katsumi Yamaoka's example is the group whose decoded name is "nnml:テス
>> ト". This is written to .newsrc.eld as the string:
>>
>> "nnml:\343\203\206\343\202\271\343\203\210"
>
>> nnml:\343\203\206\343\202\271\343\203\210
>
>> "nnml:ã\203\206ã\202¹ã\203\210"
>
>> I don't know how to turn either of these strings into the other --
>> either direction would work, but I don't know how.
>
> Are you maybe looking for decode-coding-string?
>
> (decode-coding-string
> "nnml:\343\203\206\343\202\271\343\203\210" 'utf-8) ;=> "nnml:テスト"
>
> (decode-coding-string
> (symbol-name (read "nnml:\343\203\206\343\202\271\343\203\210"))
> 'utf-8) ;=> "nnml:テスト"
No, unfortunately -- that would make everything much easier. Eventually
the idea will be to decode the strings into plain utf-8-emacs, but for
now I'm stuck keeping them in this weird half-state. I literally need a
conversion between the two versions above.
If this turns out to be too ridiculous, I'll re-slice things as I
mentioned earlier, and leave these group names as symbols.
Eric
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fixing Gnus, and string encoding question
2019-04-06 2:22 ` Eric Abrahamsen
@ 2019-04-06 3:56 ` Noam Postavsky
2019-04-07 2:32 ` Eric Abrahamsen
2019-04-07 4:10 ` Eric Abrahamsen
2019-04-06 6:20 ` Eli Zaretskii
1 sibling, 2 replies; 14+ messages in thread
From: Noam Postavsky @ 2019-04-06 3:56 UTC (permalink / raw)
To: Eric Abrahamsen; +Cc: Emacs developers
On Fri, 5 Apr 2019 at 22:22, Eric Abrahamsen <eric@ericabrahamsen.net> wrote:
> >> "nnml:\343\203\206\343\202\271\343\203\210"
> >> "nnml:ã\203\206ã\202¹ã\203\210"
> > Are you maybe looking for decode-coding-string?
> No, unfortunately -- that would make everything much easier. Eventually
> the idea will be to decode the strings into plain utf-8-emacs, but for
> now I'm stuck keeping them in this weird half-state. I literally need a
> conversion between the two versions above.
Oh, I missed which two string you meant. It seems that evaluating the
1st string with C-x C-e prints the second string in the *Messages*
buffer (I initially thought they were the same string), but
printing/inserting it doesn't work the same. The message code prints
one character at a time, and indeed, inserting one character at a time
in lisp works too:
(let ((s "nnml:\343\203\206\343\202\271\343\203\210"))
(with-temp-buffer
(mapc #'insert s)
(buffer-string)))
The following shorter expression also seem to work:
(apply #'string (string-to-list "nnml:\343\203\206\343\202\271\343\203\210"))
And apply #'unibyte-string goes back again:
(let* ((s1 "nnml:\343\203\206\343\202\271\343\203\210")
(s2 (apply #'string (string-to-list s1))))
(apply #'unibyte-string (string-to-list s2)))
I can't say I completely understand why all this works though.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fixing Gnus, and string encoding question
2019-04-06 2:22 ` Eric Abrahamsen
2019-04-06 3:56 ` Noam Postavsky
@ 2019-04-06 6:20 ` Eli Zaretskii
2019-04-07 2:30 ` Eric Abrahamsen
1 sibling, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2019-04-06 6:20 UTC (permalink / raw)
To: Eric Abrahamsen; +Cc: npostavs, emacs-devel
> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Fri, 05 Apr 2019 19:22:18 -0700
> Cc: Emacs developers <emacs-devel@gnu.org>
>
> > (decode-coding-string
> > "nnml:\343\203\206\343\202\271\343\203\210" 'utf-8) ;=> "nnml:テスト"
> >
> > (decode-coding-string
> > (symbol-name (read "nnml:\343\203\206\343\202\271\343\203\210"))
> > 'utf-8) ;=> "nnml:テスト"
>
> No, unfortunately -- that would make everything much easier. Eventually
> the idea will be to decode the strings into plain utf-8-emacs, but for
> now I'm stuck keeping them in this weird half-state. I literally need a
> conversion between the two versions above.
Why do you need to keep these strings undecoded?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fixing Gnus, and string encoding question
2019-04-05 20:47 Fixing Gnus, and string encoding question Eric Abrahamsen
2019-04-06 1:40 ` Noam Postavsky
@ 2019-04-06 8:26 ` Andreas Schwab
1 sibling, 0 replies; 14+ messages in thread
From: Andreas Schwab @ 2019-04-06 8:26 UTC (permalink / raw)
To: Eric Abrahamsen; +Cc: emacs-devel
On Apr 05 2019, Eric Abrahamsen <eric@ericabrahamsen.net> wrote:
> The problem is that non-ASCII group names are now strings, and are
> coming into the system in two different ways: written into .newsrc.eld
> with `print-escape-nonascii' set to t,
Why do you need to use the escaped representation?
> The group name is read from file with `set-buffer-multibyte' nil,
Why can't you decode the file contents first, before passing it to read?
Andreas.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fixing Gnus, and string encoding question
2019-04-06 6:20 ` Eli Zaretskii
@ 2019-04-07 2:30 ` Eric Abrahamsen
0 siblings, 0 replies; 14+ messages in thread
From: Eric Abrahamsen @ 2019-04-07 2:30 UTC (permalink / raw)
To: emacs-devel
Eli Zaretskii <eliz@gnu.org> writes:
>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Date: Fri, 05 Apr 2019 19:22:18 -0700
>> Cc: Emacs developers <emacs-devel@gnu.org>
>>
>> > (decode-coding-string
>> > "nnml:\343\203\206\343\202\271\343\203\210" 'utf-8) ;=> "nnml:テスト"
>> >
>> > (decode-coding-string
>> > (symbol-name (read "nnml:\343\203\206\343\202\271\343\203\210"))
>> > 'utf-8) ;=> "nnml:テスト"
>>
>> No, unfortunately -- that would make everything much easier. Eventually
>> the idea will be to decode the strings into plain utf-8-emacs, but for
>> now I'm stuck keeping them in this weird half-state. I literally need a
>> conversion between the two versions above.
>
> Why do you need to keep these strings undecoded?
Andreas Schwab <schwab@linux-m68k.org> writes:
> Why do you need to use the escaped representation?
>
>> The group name is read from file with `set-buffer-multibyte' nil,
>
> Why can't you decode the file contents first, before passing it to read?
That's the eventual plan. Gnus had the names encoded because they were
kept as symbols. I didn't want to go in one fell swoop from encoded
strings interned in obarrays to completely decoded strings kept in hash
tables, because I assumed I would screw something up and break Gnus and
annoy everyone. So that worked out well... But I should have done
encoded symbols kept in hash tables as the first intermediate step.
Eric
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fixing Gnus, and string encoding question
2019-04-06 3:56 ` Noam Postavsky
@ 2019-04-07 2:32 ` Eric Abrahamsen
2019-04-07 4:10 ` Eric Abrahamsen
1 sibling, 0 replies; 14+ messages in thread
From: Eric Abrahamsen @ 2019-04-07 2:32 UTC (permalink / raw)
To: emacs-devel
Noam Postavsky <npostavs@gmail.com> writes:
> On Fri, 5 Apr 2019 at 22:22, Eric Abrahamsen <eric@ericabrahamsen.net> wrote:
>
>> >> "nnml:\343\203\206\343\202\271\343\203\210"
>
>> >> "nnml:ã\203\206ã\202¹ã\203\210"
>
>> > Are you maybe looking for decode-coding-string?
>
>> No, unfortunately -- that would make everything much easier. Eventually
>> the idea will be to decode the strings into plain utf-8-emacs, but for
>> now I'm stuck keeping them in this weird half-state. I literally need a
>> conversion between the two versions above.
>
> Oh, I missed which two string you meant. It seems that evaluating the
> 1st string with C-x C-e prints the second string in the *Messages*
> buffer (I initially thought they were the same string), but
> printing/inserting it doesn't work the same. The message code prints
> one character at a time, and indeed, inserting one character at a time
> in lisp works too:
>
> (let ((s "nnml:\343\203\206\343\202\271\343\203\210"))
> (with-temp-buffer
> (mapc #'insert s)
> (buffer-string)))
>
> The following shorter expression also seem to work:
>
> (apply #'string (string-to-list "nnml:\343\203\206\343\202\271\343\203\210"))
>
> And apply #'unibyte-string goes back again:
>
> (let* ((s1 "nnml:\343\203\206\343\202\271\343\203\210")
> (s2 (apply #'string (string-to-list s1))))
> (apply #'unibyte-string (string-to-list s2)))
>
> I can't say I completely understand why all this works though.
Well that is weird and I would never have discovered it on my own --
thank you! I'm going to try to put together a patch using this now.
Thanks again,
Eric
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fixing Gnus, and string encoding question
2019-04-06 3:56 ` Noam Postavsky
2019-04-07 2:32 ` Eric Abrahamsen
@ 2019-04-07 4:10 ` Eric Abrahamsen
2019-04-07 7:05 ` Andreas Schwab
` (2 more replies)
1 sibling, 3 replies; 14+ messages in thread
From: Eric Abrahamsen @ 2019-04-07 4:10 UTC (permalink / raw)
To: Noam Postavsky; +Cc: Emacs developers
[-- Attachment #1: Type: text/plain, Size: 2283 bytes --]
Noam Postavsky <npostavs@gmail.com> writes:
> On Fri, 5 Apr 2019 at 22:22, Eric Abrahamsen <eric@ericabrahamsen.net> wrote:
>
>> >> "nnml:\343\203\206\343\202\271\343\203\210"
>
>> >> "nnml:ã\203\206ã\202¹ã\203\210"
>
>> > Are you maybe looking for decode-coding-string?
>
>> No, unfortunately -- that would make everything much easier. Eventually
>> the idea will be to decode the strings into plain utf-8-emacs, but for
>> now I'm stuck keeping them in this weird half-state. I literally need a
>> conversion between the two versions above.
>
> Oh, I missed which two string you meant. It seems that evaluating the
> 1st string with C-x C-e prints the second string in the *Messages*
> buffer (I initially thought they were the same string), but
> printing/inserting it doesn't work the same. The message code prints
> one character at a time, and indeed, inserting one character at a time
> in lisp works too:
>
> (let ((s "nnml:\343\203\206\343\202\271\343\203\210"))
> (with-temp-buffer
> (mapc #'insert s)
> (buffer-string)))
>
> The following shorter expression also seem to work:
>
> (apply #'string (string-to-list "nnml:\343\203\206\343\202\271\343\203\210"))
>
> And apply #'unibyte-string goes back again:
>
> (let* ((s1 "nnml:\343\203\206\343\202\271\343\203\210")
> (s2 (apply #'string (string-to-list s1))))
> (apply #'unibyte-string (string-to-list s2)))
>
> I can't say I completely understand why all this works though.
No, I spoke too soon. It must be another case of a string that doesn't
quite look like what it actually is. The string that looks like
"nnml:\343\203" etc must be something different: when I run your example
using a typed-in version of the string it behaves correctly, but when I
run it with the actual string I'm working with, the apply #'string
doesn't change it.
You can get the string I'm fighting with by saving the attached file and
running:
(with-temp-buffer
(set-buffer-multibyte t)
(let ((coding-system-for-read 'raw-text))
(insert-file-contents "active")
(goto-char (point-min))
(symbol-name (read (current-buffer)))))
I'm trying to turn that into something that looks like
"nnml:ã\203\206ã\202¹ã\203\210"
Thanks,
Eric
[-- Attachment #2: active --]
[-- Type: application/octet-stream, Size: 16 bytes --]
テスト 1 1 y
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fixing Gnus, and string encoding question
2019-04-07 4:10 ` Eric Abrahamsen
@ 2019-04-07 7:05 ` Andreas Schwab
2019-04-07 17:17 ` Eric Abrahamsen
2019-04-07 11:59 ` Noam Postavsky
2019-04-07 12:41 ` Andreas Schwab
2 siblings, 1 reply; 14+ messages in thread
From: Andreas Schwab @ 2019-04-07 7:05 UTC (permalink / raw)
To: Eric Abrahamsen; +Cc: Noam Postavsky, Emacs developers
Symbol names can be unibyte and multibyte. Make sure to get that right.
If you see ã instead of \343 then the symbol has a unibyte name.
Andreas.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fixing Gnus, and string encoding question
2019-04-07 4:10 ` Eric Abrahamsen
2019-04-07 7:05 ` Andreas Schwab
@ 2019-04-07 11:59 ` Noam Postavsky
2019-04-07 12:18 ` Andreas Schwab
2019-04-07 12:41 ` Andreas Schwab
2 siblings, 1 reply; 14+ messages in thread
From: Noam Postavsky @ 2019-04-07 11:59 UTC (permalink / raw)
To: Eric Abrahamsen; +Cc: Emacs developers
On Sun, 7 Apr 2019 at 00:10, Eric Abrahamsen <eric@ericabrahamsen.net> wrote:
> You can get the string I'm fighting with by saving the attached file and
> running:
>
> (with-temp-buffer
> (set-buffer-multibyte t)
> (let ((coding-system-for-read 'raw-text))
> (insert-file-contents "active")
> (goto-char (point-min))
> (symbol-name (read (current-buffer)))))
>
> I'm trying to turn that into something that looks like
> "nnml:ã\203\206ã\202¹ã\203\210"
Ah, needs multibyte-char-to-unibyte:
(apply #'string
(mapcar #'multibyte-char-to-unibyte
(with-temp-buffer
(set-buffer-multibyte t)
(let ((coding-system-for-read 'raw-text))
(insert-file-contents "active")
(goto-char (point-min))
(symbol-name (read (current-buffer)))))))
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fixing Gnus, and string encoding question
2019-04-07 11:59 ` Noam Postavsky
@ 2019-04-07 12:18 ` Andreas Schwab
0 siblings, 0 replies; 14+ messages in thread
From: Andreas Schwab @ 2019-04-07 12:18 UTC (permalink / raw)
To: Noam Postavsky; +Cc: Eric Abrahamsen, Emacs developers
On Apr 07 2019, Noam Postavsky <npostavs@gmail.com> wrote:
> On Sun, 7 Apr 2019 at 00:10, Eric Abrahamsen <eric@ericabrahamsen.net> wrote:
>
>> You can get the string I'm fighting with by saving the attached file and
>> running:
>>
>> (with-temp-buffer
>> (set-buffer-multibyte t)
>> (let ((coding-system-for-read 'raw-text))
>> (insert-file-contents "active")
>> (goto-char (point-min))
>> (symbol-name (read (current-buffer)))))
>>
>> I'm trying to turn that into something that looks like
>> "nnml:ã\203\206ã\202¹ã\203\210"
>
> Ah, needs multibyte-char-to-unibyte:
>
> (apply #'string
> (mapcar #'multibyte-char-to-unibyte
> (with-temp-buffer
> (set-buffer-multibyte t)
> (let ((coding-system-for-read 'raw-text))
> (insert-file-contents "active")
> (goto-char (point-min))
> (symbol-name (read (current-buffer)))))))
(encode-coding-string (symbol-name (read (current-buffer))) 'raw-text)
Andreas.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fixing Gnus, and string encoding question
2019-04-07 4:10 ` Eric Abrahamsen
2019-04-07 7:05 ` Andreas Schwab
2019-04-07 11:59 ` Noam Postavsky
@ 2019-04-07 12:41 ` Andreas Schwab
2 siblings, 0 replies; 14+ messages in thread
From: Andreas Schwab @ 2019-04-07 12:41 UTC (permalink / raw)
To: Eric Abrahamsen; +Cc: Noam Postavsky, Emacs developers
On Apr 06 2019, Eric Abrahamsen <eric@ericabrahamsen.net> wrote:
> (with-temp-buffer
> (set-buffer-multibyte t)
> (let ((coding-system-for-read 'raw-text))
> (insert-file-contents "active")
> (goto-char (point-min))
> (symbol-name (read (current-buffer)))))
>
> I'm trying to turn that into something that looks like
> "nnml:ã\203\206ã\202¹ã\203\210"
(decode-coding-string (symbol-name (read "nnml:\343\203\206\343\202\271\343\203\210")) 'latin-1)
Andreas.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fixing Gnus, and string encoding question
2019-04-07 7:05 ` Andreas Schwab
@ 2019-04-07 17:17 ` Eric Abrahamsen
0 siblings, 0 replies; 14+ messages in thread
From: Eric Abrahamsen @ 2019-04-07 17:17 UTC (permalink / raw)
To: emacs-devel
Andreas Schwab <schwab@linux-m68k.org> writes:
> Symbol names can be unibyte and multibyte. Make sure to get that right.
> If you see ã instead of \343 then the symbol has a unibyte name.
I will meditate on this for a bit.
Andreas Schwab <schwab@linux-m68k.org> writes:
> On Apr 06 2019, Eric Abrahamsen <eric@ericabrahamsen.net> wrote:
>
>> (with-temp-buffer
>> (set-buffer-multibyte t)
>> (let ((coding-system-for-read 'raw-text))
>> (insert-file-contents "active")
>> (goto-char (point-min))
>> (symbol-name (read (current-buffer)))))
>>
>> I'm trying to turn that into something that looks like
>> "nnml:ã\203\206ã\202¹ã\203\210"
>
> (decode-coding-string (symbol-name (read "nnml:\343\203\206\343\202\271\343\203\210")) 'latin-1)
That did it! What a relief. I'm not sure why 'latin-1 in particular, but
that aligns all the strings correctly. I'll try to figure that out.
Huge thanks to you and Noam!
Eric
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2019-04-07 17:17 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-04-05 20:47 Fixing Gnus, and string encoding question Eric Abrahamsen
2019-04-06 1:40 ` Noam Postavsky
2019-04-06 2:22 ` Eric Abrahamsen
2019-04-06 3:56 ` Noam Postavsky
2019-04-07 2:32 ` Eric Abrahamsen
2019-04-07 4:10 ` Eric Abrahamsen
2019-04-07 7:05 ` Andreas Schwab
2019-04-07 17:17 ` Eric Abrahamsen
2019-04-07 11:59 ` Noam Postavsky
2019-04-07 12:18 ` Andreas Schwab
2019-04-07 12:41 ` Andreas Schwab
2019-04-06 6:20 ` Eli Zaretskii
2019-04-07 2:30 ` Eric Abrahamsen
2019-04-06 8:26 ` Andreas Schwab
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).