* how to calculate the size of string in bytes? @ 2015-08-18 9:11 Sam Halliday 2015-08-18 10:13 ` tomas ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Sam Halliday @ 2015-08-18 9:11 UTC (permalink / raw) To: help-gnu-emacs Hi all, We've had to change the ENSIME protocol to be more friendly to other editors and this has meant changing how we frame TCP messages. We used to have a 6 character hex number at the start of each message that counted the number of multibyte characters, but we'd like to change it to be the number of bytes in the message. We're sending the string to `process-send-string' and `read'ing from the associated network buffer. But when calculating the outgoing length of the string that we want to send, we use `length' --- but we need this to be `length-in-bytes' not the number of multibyte chars. Is there a built in function to do this or am I going to have to iterate the string and count the byte size of each character? A quick test shows that (length (encode-coding-string "EURO" 'raw-text)) seems to give the correct result (1 for ASCII, 2 for Pound Sterling, 3 for Euro), but I am not 100% sure if this is correct. Similarly, when we read from the network, we want to ensure that we `read' numbers of bytes, not multibyte chars. I *think* we are doing the right thing here, but if somebody could check, that would be greatly appreciated. These are the relevant part of our Emacs code ;; https://github.com/ensime/ensime-emacs/blob/master/ensime-client.el#L507 (defun ensime-net-send (sexp proc) (let* ((msg (concat (ensime-prin1-to-string sexp) "\n")) (string (concat (ensime-net-encode-length (length msg)) msg)) (coding-system (cdr (process-coding-system proc)))) (when ensime--debug-messages (message "--> %s" sexp)) (ensime-log-event sexp) (process-send-string proc string))) ;; https://github.com/ensime/ensime-emacs/blob/master/ensime-client.el#L584 (defun ensime-net-read () "Read a message from the network buffer." (goto-char (point-min)) (let* ((length (ensime-net-decode-length)) (start (+ 6 (point))) (end (+ start length))) (assert (plusp length)) (goto-char (byte-to-position start)) (prog1 (read (current-buffer)) (delete-region (- (byte-to-position start) 6) (byte-to-position end))))) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: how to calculate the size of string in bytes? 2015-08-18 9:11 how to calculate the size of string in bytes? Sam Halliday @ 2015-08-18 10:13 ` tomas 2015-08-18 14:37 ` Eli Zaretskii ` (2 more replies) [not found] ` <mailman.8504.1439892841.904.help-gnu-emacs@gnu.org> 2015-08-18 14:34 ` Eli Zaretskii 2 siblings, 3 replies; 19+ messages in thread From: tomas @ 2015-08-18 10:13 UTC (permalink / raw) To: Sam Halliday; +Cc: help-gnu-emacs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, Aug 18, 2015 at 02:11:54AM -0700, Sam Halliday wrote: > Hi all, > > We've had to change the ENSIME protocol to be more friendly to other editors and this has meant changing how we frame TCP messages. > > We used to have a 6 character hex number at the start of each message that counted the number of multibyte characters, but we'd like to change it to be the number of bytes in the message. > > We're sending the string to `process-send-string' and `read'ing from the associated network buffer. But when calculating the outgoing length of the string that we want to send, we use `length' --- but we need this to be `length-in-bytes' not the number of multibyte chars. Is there a built in function to do this or am I going to have to iterate the string and count the byte size of each character? > > A quick test shows that > > (length (encode-coding-string "EURO" 'raw-text)) > > seems to give the correct result (1 for ASCII, 2 for Pound Sterling, 3 for Euro), but I am not 100% sure if this is correct. Raw is, afaik, Emacs's internal coding system. You don't want traces of it in the network :-) I'd expect you to use whichever coding system the network protocol prescribes (these days it'd be UTF-8 by default). Things will (mostly) work for raw-text since it's nearly UTF-8. The really correct way to do this (AFAICS) would be to find out which encoding process-send-string is going to use (via process-coding-system) and use *that* in the length calculation -- this way you won't lie :-) So I'd try this (slightly reordering the let*) (let* ((msg (concat (ensime-prin1-to-string sexp) "\n")) (coding-system (cdr (process-coding-system proc))) (string (concat (ensime-net-encode-length (length encode-coding-string msg coding-system)) msg)) ... It seems somewhat wasteful to encode msg (to find its length) just to let process-send-string encode again -- perhaps there's a better idiom around for that. The use case seems common enough. Anyone? regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAlXTBWAACgkQBcgs9XrR2kYjzACfVd/+R0wNKqWVt5sXxX/9WVj2 OjQAnRRuUdorjnIjd+tpL4z7frx1JGYZ =yjMt -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: how to calculate the size of string in bytes? 2015-08-18 10:13 ` tomas @ 2015-08-18 14:37 ` Eli Zaretskii 2015-08-18 14:45 ` tomas 2015-08-18 21:47 ` Stefan Monnier [not found] ` <mailman.8577.1439934462.904.help-gnu-emacs@gnu.org> 2 siblings, 1 reply; 19+ messages in thread From: Eli Zaretskii @ 2015-08-18 14:37 UTC (permalink / raw) To: help-gnu-emacs > Date: Tue, 18 Aug 2015 12:13:52 +0200 > From: <tomas@tuxteam.de> > Cc: help-gnu-emacs@gnu.org > > Raw is, afaik, Emacs's internal coding system. Almost, with the exception of raw bytes and characters from un-unified CJK charsets. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: how to calculate the size of string in bytes? 2015-08-18 14:37 ` Eli Zaretskii @ 2015-08-18 14:45 ` tomas 2015-08-18 15:00 ` Eli Zaretskii 0 siblings, 1 reply; 19+ messages in thread From: tomas @ 2015-08-18 14:45 UTC (permalink / raw) To: Eli Zaretskii; +Cc: help-gnu-emacs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, Aug 18, 2015 at 05:37:11PM +0300, Eli Zaretskii wrote: > > Date: Tue, 18 Aug 2015 12:13:52 +0200 > > From: <tomas@tuxteam.de> > > Cc: help-gnu-emacs@gnu.org > > > > Raw is, afaik, Emacs's internal coding system. > > Almost, with the exception of raw bytes and characters from un-unified > CJK charsets. Right, those get mapped to something non-UTF-8. Thanks for the clarification Perhaps you know that off-hand, but I can look it up/try it out: probably encode-coding-string and process-send-string would both fail, if the target coding system is set to UTF-8? regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAlXTRQoACgkQBcgs9XrR2kbuvACdGZH9gt7pKKD8kYedVDstH6yk o9kAn1Y28MywYTJZGn52s121SyOUo57C =EYVr -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: how to calculate the size of string in bytes? 2015-08-18 14:45 ` tomas @ 2015-08-18 15:00 ` Eli Zaretskii 2015-08-18 16:01 ` tomas 0 siblings, 1 reply; 19+ messages in thread From: Eli Zaretskii @ 2015-08-18 15:00 UTC (permalink / raw) To: help-gnu-emacs > Date: Tue, 18 Aug 2015 16:45:30 +0200 > Cc: help-gnu-emacs@gnu.org > From: <tomas@tuxteam.de> > > Perhaps you know that off-hand, but I can look it up/try it out: probably > encode-coding-string and process-send-string would both fail, if the target > coding system is set to UTF-8? Why should it fail? In which use case? I'm probably missing something. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: how to calculate the size of string in bytes? 2015-08-18 15:00 ` Eli Zaretskii @ 2015-08-18 16:01 ` tomas 2015-08-18 16:35 ` Eli Zaretskii 0 siblings, 1 reply; 19+ messages in thread From: tomas @ 2015-08-18 16:01 UTC (permalink / raw) To: Eli Zaretskii; +Cc: help-gnu-emacs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, Aug 18, 2015 at 06:00:02PM +0300, Eli Zaretskii wrote: > > Date: Tue, 18 Aug 2015 16:45:30 +0200 > > Cc: help-gnu-emacs@gnu.org > > From: <tomas@tuxteam.de> > > > > Perhaps you know that off-hand, but I can look it up/try it out: probably > > encode-coding-string and process-send-string would both fail, if the target > > coding system is set to UTF-8? > > Why should it fail? In which use case? I'm probably missing > something. Perhaps I should make my homework better before making stupid questions :-) I was thinking of "characters not expressible in UTF-8". Does Emacs have those? regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAlXTVukACgkQBcgs9XrR2kYLUACeNqxXdwZHjA/e/slUThyeS9KU JqsAni4tgUKj8QcX6PENuQYNsg4lmefS =LWIU -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: how to calculate the size of string in bytes? 2015-08-18 16:01 ` tomas @ 2015-08-18 16:35 ` Eli Zaretskii 2015-08-18 19:30 ` tomas 0 siblings, 1 reply; 19+ messages in thread From: Eli Zaretskii @ 2015-08-18 16:35 UTC (permalink / raw) To: help-gnu-emacs > Date: Tue, 18 Aug 2015 18:01:45 +0200 > Cc: help-gnu-emacs@gnu.org > From: <tomas@tuxteam.de> > > I was thinking of "characters not expressible in UTF-8". Does Emacs have > those? Raw bytes come out as themselves (which might be invalid UTF-8), but that's not a failure, that's the user's fault, because they had those bytes in the buffer to begin with. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: how to calculate the size of string in bytes? 2015-08-18 16:35 ` Eli Zaretskii @ 2015-08-18 19:30 ` tomas 2015-08-18 19:49 ` Eli Zaretskii 0 siblings, 1 reply; 19+ messages in thread From: tomas @ 2015-08-18 19:30 UTC (permalink / raw) To: Eli Zaretskii; +Cc: help-gnu-emacs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, Aug 18, 2015 at 07:35:03PM +0300, Eli Zaretskii wrote: > > Date: Tue, 18 Aug 2015 18:01:45 +0200 > > Cc: help-gnu-emacs@gnu.org > > From: <tomas@tuxteam.de> > > > > I was thinking of "characters not expressible in UTF-8". Does Emacs have > > those? > > Raw bytes come out as themselves (which might be invalid UTF-8), but > that's not a failure, that's the user's fault, because they had those > bytes in the buffer to begin with. I was having difficulties in understanding you, so I tried it out. Now I understand: Emacs's internal (raw) coding system can represent "characters not expressible in utf-8". The function encode-coding-string passes those bytes silently through, outputting an invalid utf-8 sequence. So I venture the guess that when the Emacs buffer contains something epressible as valid utf-8, 'utf-8 and 'raw are equivalent (what about combining characters?) Thanks for the insights - -- t -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAlXTh+kACgkQBcgs9XrR2kZH2QCcDjlnu5BP0UxHnBweCdE9revf sYoAn0fwO/WeoGirGfLlqA3lH1Cp9Bco =IAVl -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: how to calculate the size of string in bytes? 2015-08-18 19:30 ` tomas @ 2015-08-18 19:49 ` Eli Zaretskii 2015-08-18 20:11 ` tomas 0 siblings, 1 reply; 19+ messages in thread From: Eli Zaretskii @ 2015-08-18 19:49 UTC (permalink / raw) To: help-gnu-emacs > Date: Tue, 18 Aug 2015 21:30:49 +0200 > Cc: help-gnu-emacs@gnu.org > From: <tomas@tuxteam.de> > > I was having difficulties in understanding you Sorry about that. It's a complex issue to explain in a few words. > Now I understand: Emacs's internal (raw) coding system can represent > "characters not expressible in utf-8". More accurately, it can represent characters outside the Unicode code space. And please don't call that "raw"; the internal representation of characters used by Emacs is known as 'utf-8-emacs'. > The function encode-coding-string passes those bytes silently > through, outputting an invalid utf-8 sequence. Yes. Although in interactive functions Emacs will normally complain and ask for a better encoding. > So I venture the guess that when the Emacs buffer contains something > epressible as valid utf-8, 'utf-8 and 'raw are equivalent Yes. > (what about combining characters?) Emacs doesn't normalize/compose/decompose characters when it encodes text (with a notable exception of the utf-8-hfs encoding). Applications that want this should do that themselves, e.g. using the facilities in ucs-normalize.el. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: how to calculate the size of string in bytes? 2015-08-18 19:49 ` Eli Zaretskii @ 2015-08-18 20:11 ` tomas 0 siblings, 0 replies; 19+ messages in thread From: tomas @ 2015-08-18 20:11 UTC (permalink / raw) To: Eli Zaretskii; +Cc: help-gnu-emacs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, Aug 18, 2015 at 10:49:58PM +0300, Eli Zaretskii wrote: > > Date: Tue, 18 Aug 2015 21:30:49 +0200 > > Cc: help-gnu-emacs@gnu.org > > From: <tomas@tuxteam.de> > > > > I was having difficulties in understanding you > > Sorry about that. It's a complex issue to explain in a few words. No need to be sorry. The fault's on me -- once I did my homework things improved :-) Thanks for your patience: very much appreciated. > > Now I understand: Emacs's internal (raw) coding system can represent > > "characters not expressible in utf-8". > > More accurately, it can represent characters outside the Unicode code > space. > > And please don't call that "raw"; the internal representation of > characters used by Emacs is known as 'utf-8-emacs'. Ah, OK. Point taken. > > The function encode-coding-string passes those bytes silently > > through, outputting an invalid utf-8 sequence. > > Yes. Although in interactive functions Emacs will normally complain > and ask for a better encoding. Understood > > So I venture the guess that when the Emacs buffer contains something > > epressible as valid utf-8, 'utf-8 and 'raw are equivalent > > Yes. > > > (what about combining characters?) > > Emacs doesn't normalize/compose/decompose characters when it encodes > text (with a notable exception of the utf-8-hfs encoding). > Applications that want this should do that themselves, e.g. using the > facilities in ucs-normalize.el. Thanks: I learned quite a bit now :-) regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEUEARECAAYFAlXTkWYACgkQBcgs9XrR2kaQbwCggSK12zVBjHiFowFVsddq36SJ XmAAmON/V8XcGaUfjxW1llhEavSqcp0= =fYz9 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: how to calculate the size of string in bytes? 2015-08-18 10:13 ` tomas 2015-08-18 14:37 ` Eli Zaretskii @ 2015-08-18 21:47 ` Stefan Monnier 2015-08-19 5:43 ` tomas [not found] ` <mailman.8577.1439934462.904.help-gnu-emacs@gnu.org> 2 siblings, 1 reply; 19+ messages in thread From: Stefan Monnier @ 2015-08-18 21:47 UTC (permalink / raw) To: help-gnu-emacs > It seems somewhat wasteful to encode msg (to find its length) just > to let process-send-string encode again -- perhaps there's a better > idiom around for that. Yup: communicate with the process using bytes rather than chars! I.e. set the process's coding system to binary. Then you just need to call (encode-coding-string msg coding-system) once to get the bytes and you send them as is: they won't be re-encoded. Stefan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: how to calculate the size of string in bytes? 2015-08-18 21:47 ` Stefan Monnier @ 2015-08-19 5:43 ` tomas 0 siblings, 0 replies; 19+ messages in thread From: tomas @ 2015-08-19 5:43 UTC (permalink / raw) To: Stefan Monnier; +Cc: help-gnu-emacs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, Aug 18, 2015 at 05:47:27PM -0400, Stefan Monnier wrote: > > It seems somewhat wasteful to encode msg (to find its length) just > > to let process-send-string encode again -- perhaps there's a better > > idiom around for that. > > Yup: communicate with the process using bytes rather than chars! > > I.e. set the process's coding system to binary. > > Then you just need to call (encode-coding-string msg coding-system) once > to get the bytes and you send them as is: they won't be re-encoded. (pats forehead) Of course! Thanks, Stefan. regards - -- t -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAlXUF5MACgkQBcgs9XrR2kYb6ACfakO/BHVsih4M7IPDxJfotIPD I8kAnRYDmQF6VAnzXncPvMSjJjAOLXXS =h0oY -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <mailman.8577.1439934462.904.help-gnu-emacs@gnu.org>]
* Re: how to calculate the size of string in bytes? [not found] ` <mailman.8577.1439934462.904.help-gnu-emacs@gnu.org> @ 2015-08-19 8:57 ` Sam Halliday 2015-08-19 9:22 ` Sam Halliday 2015-08-19 19:47 ` Stefan Monnier 0 siblings, 2 replies; 19+ messages in thread From: Sam Halliday @ 2015-08-19 8:57 UTC (permalink / raw) To: help-gnu-emacs On Tuesday, 18 August 2015 22:47:44 UTC+1, Stefan Monnier wrote: > > It seems somewhat wasteful to encode msg (to find its length) just > > to let process-send-string encode again -- perhaps there's a better > > idiom around for that. > > Yup: communicate with the process using bytes rather than chars! > > I.e. set the process's coding system to binary. > > Then you just need to call (encode-coding-string msg coding-system) once > to get the bytes and you send them as is: they won't be re-encoded. > > > Stefan Heh, that's actually a very good suggestion. We'll keep that in mind if this is ever a performance bottleneck. We're hoping to move the ENSIME protocol (based on SWANK) over to S-Expressions over WebSockets which will mean we can just delete all this code. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: how to calculate the size of string in bytes? 2015-08-19 8:57 ` Sam Halliday @ 2015-08-19 9:22 ` Sam Halliday 2015-08-19 19:47 ` Stefan Monnier 1 sibling, 0 replies; 19+ messages in thread From: Sam Halliday @ 2015-08-19 9:22 UTC (permalink / raw) To: help-gnu-emacs Actually, one question Stefan. An advantage of the string encodings is that we're pretty confident that a newline will flush the network buffer. How do we make sure that a binary encoding will do the same? (or is there no buffering and we're worrying about nothing) On Wednesday, 19 August 2015 09:57:38 UTC+1, Sam Halliday wrote: > On Tuesday, 18 August 2015 22:47:44 UTC+1, Stefan Monnier wrote: > > > It seems somewhat wasteful to encode msg (to find its length) just > > > to let process-send-string encode again -- perhaps there's a better > > > idiom around for that. > > > > Yup: communicate with the process using bytes rather than chars! > > > > I.e. set the process's coding system to binary. > > > > Then you just need to call (encode-coding-string msg coding-system) once > > to get the bytes and you send them as is: they won't be re-encoded. > > > > > > Stefan > > Heh, that's actually a very good suggestion. We'll keep that in mind if this is ever a performance bottleneck. We're hoping to move the ENSIME protocol (based on SWANK) over to S-Expressions over WebSockets which will mean we can just delete all this code. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: how to calculate the size of string in bytes? 2015-08-19 8:57 ` Sam Halliday 2015-08-19 9:22 ` Sam Halliday @ 2015-08-19 19:47 ` Stefan Monnier 1 sibling, 0 replies; 19+ messages in thread From: Stefan Monnier @ 2015-08-19 19:47 UTC (permalink / raw) To: help-gnu-emacs > Heh, that's actually a very good suggestion. We'll keep that in mind if this > is ever a performance bottleneck. Actually, I recommend it for sanity reasons rather than performance reasons. It'll help you make sure the right encoding is used for the right data, and the "counts" do count the right elements as well. Stefan ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <mailman.8504.1439892841.904.help-gnu-emacs@gnu.org>]
* Re: how to calculate the size of string in bytes? [not found] ` <mailman.8504.1439892841.904.help-gnu-emacs@gnu.org> @ 2015-08-18 10:43 ` Sam Halliday 2015-08-18 11:47 ` tomas [not found] ` <mailman.8510.1439898432.904.help-gnu-emacs@gnu.org> 0 siblings, 2 replies; 19+ messages in thread From: Sam Halliday @ 2015-08-18 10:43 UTC (permalink / raw) To: help-gnu-emacs On Tuesday, 18 August 2015 11:14:04 UTC+1, to...@tuxteam.de wrote: > On Tue, Aug 18, 2015 at 02:11:54AM -0700, Sam Halliday wrote: > > We used to have a 6 character hex number at the start of each message that counted the number of multibyte characters, but we'd like to change it to be the number of bytes in the message. > > > > We're sending the string to `process-send-string' and `read'ing from the associated network buffer. But when calculating the outgoing length of the string that we want to send, we use `length' --- but we need this to be `length-in-bytes' not the number of multibyte chars. Is there a built in function to do this or am I going to have to iterate the string and count the byte size of each character? > > > > A quick test shows that > > > > (length (encode-coding-string "EURO" 'raw-text)) > > > > seems to give the correct result (1 for ASCII, 2 for Pound Sterling, 3 for Euro), but I am not 100% sure if this is correct. > > Raw is, afaik, Emacs's internal coding system. You don't want traces of it > in the network :-) We're not sending the message using raw, we're using UTF-8. But I need to calculate the length of the UTF-8 string IN BYTES as part of the payload (each messages begins with a 6 character hex encoding of the proceeding string's raw length). I'm using "raw" to calculate an approximation of the UTF-8 string's byte length, but I am aware that it might not actually be true in the general case :-/ I don't think what you've suggested would actually change the semantics, but it would allow us to use a different encoding on the wire than the encoding of the string. We don't really need to worry about that at this stage, because all our users are using UTF-8. We'll keep it in mind though. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: how to calculate the size of string in bytes? 2015-08-18 10:43 ` Sam Halliday @ 2015-08-18 11:47 ` tomas [not found] ` <mailman.8510.1439898432.904.help-gnu-emacs@gnu.org> 1 sibling, 0 replies; 19+ messages in thread From: tomas @ 2015-08-18 11:47 UTC (permalink / raw) To: Sam Halliday; +Cc: help-gnu-emacs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, Aug 18, 2015 at 03:43:44AM -0700, Sam Halliday wrote: > On Tuesday, 18 August 2015 11:14:04 UTC+1, to...@tuxteam.de wrote: > > On Tue, Aug 18, 2015 at 02:11:54AM -0700, Sam Halliday wrote: > > > We used to have a 6 character hex number at the start of each message that counted the number of multibyte characters, but we'd like to change it to be the number of bytes in the message. > > > > > > We're sending the string to `process-send-string' and `read'ing from the associated network buffer. But when calculating the outgoing length of the string that we want to send, we use `length' --- but we need this to be `length-in-bytes' not the number of multibyte chars. Is there a built in function to do this or am I going to have to iterate the string and count the byte size of each character? > > > > > > A quick test shows that > > > > > > (length (encode-coding-string "EURO" 'raw-text)) > > > > > > seems to give the correct result (1 for ASCII, 2 for Pound Sterling, 3 for Euro), but I am not 100% sure if this is correct. > > > > Raw is, afaik, Emacs's internal coding system. You don't want traces of it > > in the network :-) > > > We're not sending the message using raw, we're using UTF-8. But I need to calculate the length of the UTF-8 string IN BYTES as part of the payload (each messages begins with a 6 character hex encoding of the proceeding string's raw length). Yes, I get that. The way I understand encode-coding-string is that you give it the target encoding: (length (encode-coding-string foo 'raw-text)) would mean "transform this string to whatever Emacs uses as internal encoding and measure its length in bytes", whereas what you want is, AFAIU "transform this string to UTF-8 and measure its length in bytes", which would read as: (length (encode-coding-string foo 'utf-8)) > I'm using "raw" to calculate an approximation of the UTF-8 string's byte length, but I am aware that it might not actually be true in the general case :-/ Use utf-8 then? > I don't think what you've suggested would actually change the semantics, but it would allow us to use a different encoding on the wire than the encoding of the string. We don't really need to worry about that at this stage, because all our users are using UTF-8. We'll keep it in mind though. But, but... isn't that a bug lurking? And it would be so easy to fix... (that is unrelated to the above issue -- that I think you want utf-8 instead of raw) Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAlXTGzcACgkQBcgs9XrR2kbq/wCggTBpkebxoL9wIXzoFcSBZDAq RqQAmwTy3yopi8MdM3r1xn9iQDXYRYWa =ISij -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <mailman.8510.1439898432.904.help-gnu-emacs@gnu.org>]
* Re: how to calculate the size of string in bytes? [not found] ` <mailman.8510.1439898432.904.help-gnu-emacs@gnu.org> @ 2015-08-18 12:06 ` Sam Halliday 0 siblings, 0 replies; 19+ messages in thread From: Sam Halliday @ 2015-08-18 12:06 UTC (permalink / raw) To: help-gnu-emacs On Tuesday, 18 August 2015 12:47:15 UTC+1, to...@tuxteam.de wrote: > > We're not sending the message using raw, we're using UTF-8. But I need to calculate the length of the UTF-8 string IN BYTES as part of the payload (each messages begins with a 6 character hex encoding of the proceeding string's raw length). > > Yes, I get that. The way I understand encode-coding-string is that you give > it the target encoding: > > (length (encode-coding-string foo 'raw-text)) Aah, ok, I didn't get what you were saying. I thought `utf-8' here would just give me back the original. OK, so I really need (length (encode-coding-string "EURO" 'utf-8)) and actually, since the process can be using a different encoding, I need (length (encode-coding-string "EURO" my-encoding)) Thanks! I already pushed a quick fix, but this seems more solid. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: how to calculate the size of string in bytes? 2015-08-18 9:11 how to calculate the size of string in bytes? Sam Halliday 2015-08-18 10:13 ` tomas [not found] ` <mailman.8504.1439892841.904.help-gnu-emacs@gnu.org> @ 2015-08-18 14:34 ` Eli Zaretskii 2 siblings, 0 replies; 19+ messages in thread From: Eli Zaretskii @ 2015-08-18 14:34 UTC (permalink / raw) To: help-gnu-emacs > Date: Tue, 18 Aug 2015 02:11:54 -0700 (PDT) > From: Sam Halliday <sam.halliday@gmail.com> > > Hi all, > > We've had to change the ENSIME protocol to be more friendly to other editors and this has meant changing how we frame TCP messages. > > We used to have a 6 character hex number at the start of each message that counted the number of multibyte characters, but we'd like to change it to be the number of bytes in the message. > > We're sending the string to `process-send-string' and `read'ing from the associated network buffer. But when calculating the outgoing length of the string that we want to send, we use `length' --- but we need this to be `length-in-bytes' not the number of multibyte chars. Is there a built in function to do this or am I going to have to iterate the string and count the byte size of each character? Emacs 25 has bufferpos-to-filepos, which I think does what you want. > A quick test shows that > > (length (encode-coding-string "EURO" 'raw-text)) > > seems to give the correct result (1 for ASCII, 2 for Pound Sterling, 3 for Euro), but I am not 100% sure if this is correct. It will fail if the string includes some exotic characters or raw bytes. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2015-08-19 19:47 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-08-18 9:11 how to calculate the size of string in bytes? Sam Halliday 2015-08-18 10:13 ` tomas 2015-08-18 14:37 ` Eli Zaretskii 2015-08-18 14:45 ` tomas 2015-08-18 15:00 ` Eli Zaretskii 2015-08-18 16:01 ` tomas 2015-08-18 16:35 ` Eli Zaretskii 2015-08-18 19:30 ` tomas 2015-08-18 19:49 ` Eli Zaretskii 2015-08-18 20:11 ` tomas 2015-08-18 21:47 ` Stefan Monnier 2015-08-19 5:43 ` tomas [not found] ` <mailman.8577.1439934462.904.help-gnu-emacs@gnu.org> 2015-08-19 8:57 ` Sam Halliday 2015-08-19 9:22 ` Sam Halliday 2015-08-19 19:47 ` Stefan Monnier [not found] ` <mailman.8504.1439892841.904.help-gnu-emacs@gnu.org> 2015-08-18 10:43 ` Sam Halliday 2015-08-18 11:47 ` tomas [not found] ` <mailman.8510.1439898432.904.help-gnu-emacs@gnu.org> 2015-08-18 12:06 ` Sam Halliday 2015-08-18 14:34 ` Eli Zaretskii
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).