Conversion to unibyte, magic latin-1?

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* Conversion to unibyte, magic latin-1?
@ 2019-05-04 15:48 Julian Scheid
  2019-05-04 22:53 ` Stefan Monnier
  0 siblings, 1 reply; 2+ messages in thread
From: Julian Scheid @ 2019-05-04 15:48 UTC (permalink / raw)
  To: help-gnu-emacs

I'm trying to work out how to calculate the SHA-256 for a binary
string reliably (and efficiently) in Elisp.

Consider this binary string:

    $ printf '\x52\xbc\xdd\x9e' | openssl dgst -sha256
    cb0b03042399237f7fac31d47f98ac0899533d298db3a697af29621b49f86888

`secure-hash' doesn't produce the same result (all tested in 26.2):

    (secure-hash 'sha256 (concat [#x52 #xbc #xdd #x9e]))
    "cfdc1612961dc873079178b92bf0aafaa6bd33731cbaa60841eef163f85074e8"

After studying the C source code I've figured out that this is because
it does multi-byte conversion behind the scenes (by the way, C-h f
secure-hash RET doesn't tell you this.)

Armed with this knowledge, and seeing in the code that no conversion
is done for unibyte strings, I've got it to work with
`string-make-unibyte':

    (secure-hash 'sha256 (string-make-unibyte (concat [#x52 #xbc #xdd
#x9e])))
    "cb0b03042399237f7fac31d47f98ac0899533d298db3a697af29621b49f86888"

Alas, `string-make-unibyte' is declared obsolete.  The help page tells
me that I should use `encode-coding-string' instead, so I tried that
with a few obvious encodings, but no luck:

    (secure-hash 'sha256 (encode-coding-string (concat [#x52 #xbc #xdd
#x9e]) 'raw-text))
    "cfdac1612961dc873079178b92bf0aafaa6bd33731cbaa60841eef163f85074e8"

    (secure-hash 'sha256 (encode-coding-string (concat [#x52 #xbc #xdd
#x9e]) 'binary))
    "cfdc1612961dc873079178b92bf0aafaa6bd33731cbaa60841eef163f85074e8"

In the end I searched for a coding system that works:

    (let* ((data (concat [#x52 #xbc #xdd #x9e]))
           (ref (secure-hash 'sha256 (string-make-unibyte data))))
      (seq-filter
       (lambda (coding-system)
         (string= (secure-hash 'sha256 (encode-coding-string data
coding-system))
                  ref))
       (coding-system-list)))
    (latin-1 iso-8859-1 iso-latin-1)

    (secure-hash 'sha256 (encode-coding-string (concat [#x52 #xbc #xdd
#x9e]) 'latin-1))
    "cb0b03042399237f7fac31d47f98ac0899533d298db3a697af29621b49f86888"

This works, but I'm confused... why does latin-1 work but raw-text or
binary doesn't?  More importantly, how do I know that it works
everywhere and will continue to work in the future?  Is latin-1 a
"magic" encoding or does it only happen to work because it matches
with some default coding system set somewhere in my config?

For what it's worth, I can't see a mention of latin-1 anywhere in my
coding system settings (which are all defaults, afaik):

    (list
     default-file-name-coding-system
     default-process-coding-system
     default-keyboard-coding-system
     default-process-coding-system
     default-terminal-coding-system
     coding-system-for-write
     (car coding-category-list))
    (utf-8-unix (utf-8-unix . utf-8-unix) utf-8-unix (utf-8-unix .
utf-8-unix) utf-8-unix nil coding-category-raw-text)

Could someone shed light on this?


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Conversion to unibyte, magic latin-1?
  2019-05-04 15:48 Conversion to unibyte, magic latin-1? Julian Scheid
@ 2019-05-04 22:53 ` Stefan Monnier
  0 siblings, 0 replies; 2+ messages in thread
From: Stefan Monnier @ 2019-05-04 22:53 UTC (permalink / raw)
  To: help-gnu-emacs

>     (secure-hash 'sha256 (concat [#x52 #xbc #xdd #x9e]))
>     "cfdc1612961dc873079178b92bf0aafaa6bd33731cbaa60841eef163f85074e8"

(concat [#x52 #xbc #xdd #x9e]) takes the character codes you specified
and interprets them as unicode chars rather than as bytes.  So that's
the origin of your problem.

You want to use (unibyte-string #x52 #xbc #xdd #x9e) instead.

> `string-make-unibyte':

Don't.  This will just paper over problems.

If you have a multibyte string (i.e. a sequence of characters) and you
need to convert it to a unibyte string (i.e. a sequence of bytes), then
you want to use `encode-coding-string` where the CODING-SYSTEM indicates
how to convert each char to a corresponding sequence of bytes.

> with a few obvious encodings, but no luck:

`raw-text` is not an obvious encoding.
Encoding with `raw-text` only works in a meaningful way on sequences of
chars where the chars are themselves bytes (these are the char codes
0-127 and #x3fff80-#x3fffff).

> This works, but I'm confused... why does latin-1 work but raw-text or
> binary doesn't?

latin-1 will work in some cases, but only by accident.
Better go back to the origin of the problem (why do you end up with
a multibyte string when what you wanted was a unibyte string).

        Stefan

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2019-05-04 22:53 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-05-04 15:48 Conversion to unibyte, magic latin-1? Julian Scheid
2019-05-04 22:53 ` Stefan Monnier

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.