string-bytes and coding systems

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* string-bytes and coding systems
@ 2017-03-08 23:17 Eric Abrahamsen
  2017-03-09  7:46 ` hector
  2017-03-09 16:01 ` Eli Zaretskii
  0 siblings, 2 replies; 16+ messages in thread
From: Eric Abrahamsen @ 2017-03-08 23:17 UTC (permalink / raw)
  To: help-gnu-emacs

I'm writing a function that's supposed to wrap too-long text lines; the
RFC says anything over 75 octets (excluding eol) needs to be wrapped,
but multibyte characters must not be split.

Everything seems to be working fine, but I want to make sure I'm not
making any dangerous assumptions about `string-bytes' and encoding.

I'm essentially taking the `string-bytes' of each line, and if it's too
long, popping characters off the end until it's fewer than 75 bytes.

My understanding/assumption is that `string-bytes' returns the number of
bytes according to Emacs' internal coding system, which is close enough
to utf-8 to make no difference. When this text gets written to file it
will also be encoded as utf-8, ergo testing string lengths with
`string-bytes' is going to always produce the right results in the final
file.

Have I understood things correctly?

Thanks!
Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: string-bytes and coding systems
  2017-03-08 23:17 string-bytes and coding systems Eric Abrahamsen
@ 2017-03-09  7:46 ` hector
  2017-03-09  7:54   ` Yuri Khan
  2017-03-09 16:01 ` Eli Zaretskii
  1 sibling, 1 reply; 16+ messages in thread
From: hector @ 2017-03-09  7:46 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: help-gnu-emacs

On Wed, Mar 08, 2017 at 03:17:07PM -0800, Eric Abrahamsen wrote:
> I'm writing a function that's supposed to wrap too-long text lines; the
> RFC says anything over 75 octets (excluding eol) needs to be wrapped,
> but multibyte characters must not be split.

Why don't you just use fill-paragraph?



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: string-bytes and coding systems
  2017-03-09  7:46 ` hector
@ 2017-03-09  7:54   ` Yuri Khan
  2017-03-09  9:23     ` hector
  0 siblings, 1 reply; 16+ messages in thread
From: Yuri Khan @ 2017-03-09  7:54 UTC (permalink / raw)
  To: hector; +Cc: Eric Abrahamsen, help-gnu-emacs@gnu.org

On Thu, Mar 9, 2017 at 2:46 PM, hector <hectorlahoz@gmail.com> wrote:
> On Wed, Mar 08, 2017 at 03:17:07PM -0800, Eric Abrahamsen wrote:
>> I'm writing a function that's supposed to wrap too-long text lines; the
>> RFC says anything over 75 octets (excluding eol) needs to be wrapped,
>> but multibyte characters must not be split.
>
> Why don't you just use fill-paragraph?

Because that works in terms of characters, not octets?



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: string-bytes and coding systems
  2017-03-09  7:54   ` Yuri Khan
@ 2017-03-09  9:23     ` hector
  2017-03-09 17:36       ` Eric Abrahamsen
  0 siblings, 1 reply; 16+ messages in thread
From: hector @ 2017-03-09  9:23 UTC (permalink / raw)
  To: Yuri Khan; +Cc: help-gnu-emacs

On Thu, Mar 09, 2017 at 02:54:16PM +0700, Yuri Khan wrote:
> On Thu, Mar 9, 2017 at 2:46 PM, hector <hectorlahoz@gmail.com> wrote:
> > On Wed, Mar 08, 2017 at 03:17:07PM -0800, Eric Abrahamsen wrote:
> >> I'm writing a function that's supposed to wrap too-long text lines; the
> >> RFC says anything over 75 octets (excluding eol) needs to be wrapped,
> >> but multibyte characters must not be split.
> >
> > Why don't you just use fill-paragraph?
> 
> Because that works in terms of characters, not octets?

Right.
It would be weird for a user command to work in terms of octets, wouldn't it?



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: string-bytes and coding systems
  2017-03-08 23:17 string-bytes and coding systems Eric Abrahamsen
  2017-03-09  7:46 ` hector
@ 2017-03-09 16:01 ` Eli Zaretskii
  2017-03-09 17:35   ` Eric Abrahamsen
  2017-03-10  9:02   ` Stefan Monnier
  1 sibling, 2 replies; 16+ messages in thread
From: Eli Zaretskii @ 2017-03-09 16:01 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Wed, 08 Mar 2017 15:17:07 -0800
> 
> I'm essentially taking the `string-bytes' of each line, and if it's too
> long, popping characters off the end until it's fewer than 75 bytes.
> 
> My understanding/assumption is that `string-bytes' returns the number of
> bytes according to Emacs' internal coding system

Yes.

> which is close enough to utf-8 to make no difference.

No.  The deviations from UTF-8 could be significant in some cases,
with some exotic characters and with raw bytes.

> When this text gets written to file it will also be encoded as
> utf-8, ergo testing string lengths with `string-bytes' is going to
> always produce the right results in the final file.

I suggest to use filepos-to-bufferpos to find where to break text into
lines.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: string-bytes and coding systems
  2017-03-09 16:01 ` Eli Zaretskii
@ 2017-03-09 17:35   ` Eric Abrahamsen
  2017-03-10  9:02   ` Stefan Monnier
  1 sibling, 0 replies; 16+ messages in thread
From: Eric Abrahamsen @ 2017-03-09 17:35 UTC (permalink / raw)
  To: help-gnu-emacs

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Date: Wed, 08 Mar 2017 15:17:07 -0800
>> 
>> I'm essentially taking the `string-bytes' of each line, and if it's too
>> long, popping characters off the end until it's fewer than 75 bytes.
>> 
>> My understanding/assumption is that `string-bytes' returns the number of
>> bytes according to Emacs' internal coding system
>
> Yes.
>
>> which is close enough to utf-8 to make no difference.
>
> No.  The deviations from UTF-8 could be significant in some cases,
> with some exotic characters and with raw bytes.

Good to know.

>> When this text gets written to file it will also be encoded as
>> utf-8, ergo testing string lengths with `string-bytes' is going to
>> always produce the right results in the final file.
>
> I suggest to use filepos-to-bufferpos to find where to break text into
> lines.

I'll look into that. Thank you!

Eric




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: string-bytes and coding systems
  2017-03-09  9:23     ` hector
@ 2017-03-09 17:36       ` Eric Abrahamsen
  2017-03-10  4:39         ` Thien-Thi Nguyen
  2017-03-10  4:59         ` Alexis
  0 siblings, 2 replies; 16+ messages in thread
From: Eric Abrahamsen @ 2017-03-09 17:36 UTC (permalink / raw)
  To: help-gnu-emacs

hector <hectorlahoz@gmail.com> writes:

> On Thu, Mar 09, 2017 at 02:54:16PM +0700, Yuri Khan wrote:
>> On Thu, Mar 9, 2017 at 2:46 PM, hector <hectorlahoz@gmail.com> wrote:
>> > On Wed, Mar 08, 2017 at 03:17:07PM -0800, Eric Abrahamsen wrote:
>> >> I'm writing a function that's supposed to wrap too-long text lines; the
>> >> RFC says anything over 75 octets (excluding eol) needs to be wrapped,
>> >> but multibyte characters must not be split.
>> >
>> > Why don't you just use fill-paragraph?
>> 
>> Because that works in terms of characters, not octets?
>
> Right.
> It would be weird for a user command to work in terms of octets, wouldn't it?

It's not really a user command, I'm exporting vCard objects to a file.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: string-bytes and coding systems
  2017-03-09 17:36       ` Eric Abrahamsen
@ 2017-03-10  4:39         ` Thien-Thi Nguyen
  2017-03-10  6:36           ` Eric Abrahamsen
  2017-03-10  4:59         ` Alexis
  1 sibling, 1 reply; 16+ messages in thread
From: Thien-Thi Nguyen @ 2017-03-10  4:39 UTC (permalink / raw)
  To: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 608 bytes --]


() Eric Abrahamsen <eric@ericabrahamsen.net>
() Thu, 09 Mar 2017 09:36:03 -0800

   It's not really a user command, I'm exporting vCard objects
   to a file.

I'm interested in having bindat.el exercised more, hence the
tangential question: Do you think this could be implemented
using bindat.el?

-- 
Thien-Thi Nguyen -----------------------------------------------
 (defun responsep (query)
   (pcase (context query)
     (`(technical ,ml) (correctp ml))
     ...))                              748E A0E8 1CB8 A748 9BFA
--------------------------------------- 6CE4 6703 2224 4C80 7502


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: string-bytes and coding systems
  2017-03-09 17:36       ` Eric Abrahamsen
  2017-03-10  4:39         ` Thien-Thi Nguyen
@ 2017-03-10  4:59         ` Alexis
  2017-03-10  6:10           ` Eric Abrahamsen
  1 sibling, 1 reply; 16+ messages in thread
From: Alexis @ 2017-03-10  4:59 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: help-gnu-emacs

Eric Abrahamsen <eric@ericabrahamsen.net> writes:

> It's not really a user command, I'm exporting vCard objects to a file.

Might my `org-vcard` package be of any help to you?

https://github.com/flexibeast/org-vcard

It very much needs refactoring to separate out the vCard
input-and-output stuff from everything else - i've been struggling to
find the time and energy to do so - but maybe you'll find some of the
machinery useful?

Alexis.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: string-bytes and coding systems
  2017-03-10  4:59         ` Alexis
@ 2017-03-10  6:10           ` Eric Abrahamsen
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Abrahamsen @ 2017-03-10  6:10 UTC (permalink / raw)
  To: help-gnu-emacs

Alexis <flexibeast@gmail.com> writes:

> Eric Abrahamsen <eric@ericabrahamsen.net> writes:
>
>> It's not really a user command, I'm exporting vCard objects to a file.
>
> Might my `org-vcard` package be of any help to you?
>
> https://github.com/flexibeast/org-vcard
>
> It very much needs refactoring to separate out the vCard
> input-and-output stuff from everything else - i've been struggling to
> find the time and energy to do so - but maybe you'll find some of the
> machinery useful?

Yes! I didn't know this was out there -- it's always good to see other
people's approach to the same problems. Thanks!

It's true your package is closely tied to Org mode. I'm sure I'll still
be able to take some pointers from it, though. Thank god I'm not trying
to support vCard 2.1.

Here's where I'm at now. It does the escaping, but the line folding
(while implemented) isn't actually called; I'm still chewing on Eli's
pointers.

https://github.com/girzel/ebdb/blob/master/ebdb-vcard.el

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: string-bytes and coding systems
  2017-03-10  4:39         ` Thien-Thi Nguyen
@ 2017-03-10  6:36           ` Eric Abrahamsen
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Abrahamsen @ 2017-03-10  6:36 UTC (permalink / raw)
  To: help-gnu-emacs

Thien-Thi Nguyen <ttn@gnu.org> writes:

> () Eric Abrahamsen <eric@ericabrahamsen.net>
> () Thu, 09 Mar 2017 09:36:03 -0800
>
>    It's not really a user command, I'm exporting vCard objects
>    to a file.
>
> I'm interested in having bindat.el exercised more, hence the
> tangential question: Do you think this could be implemented
> using bindat.el?

I'm ashamed to say that I understand binary only enough to see why I
have problems with encoding. I aspire to someday use `logior' in anger.
I don't understand this library!




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: string-bytes and coding systems
  2017-03-09 16:01 ` Eli Zaretskii
  2017-03-09 17:35   ` Eric Abrahamsen
@ 2017-03-10  9:02   ` Stefan Monnier
  2017-03-10 16:37     ` Eric Abrahamsen
  1 sibling, 1 reply; 16+ messages in thread
From: Stefan Monnier @ 2017-03-10  9:02 UTC (permalink / raw)
  To: help-gnu-emacs

>> When this text gets written to file it will also be encoded as
>> utf-8, ergo testing string lengths with `string-bytes' is going to
>> always produce the right results in the final file.
> I suggest to use filepos-to-bufferpos to find where to break text into
> lines.

BTW, filepos-to-bufferpos uses string-bytes (or equivalent data, with
the same caveat for raw bytes) for utf-8 buffers ;-)


        Stefan




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: string-bytes and coding systems
  2017-03-10  9:02   ` Stefan Monnier
@ 2017-03-10 16:37     ` Eric Abrahamsen
  2017-03-10 18:26       ` Stefan Monnier
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Abrahamsen @ 2017-03-10 16:37 UTC (permalink / raw)
  To: help-gnu-emacs

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> When this text gets written to file it will also be encoded as
>>> utf-8, ergo testing string lengths with `string-bytes' is going to
>>> always produce the right results in the final file.
>> I suggest to use filepos-to-bufferpos to find where to break text into
>> lines.
>
> BTW, filepos-to-bufferpos uses string-bytes (or equivalent data, with
> the same caveat for raw bytes) for utf-8 buffers ;-)

Hmm, maybe it will be bindat.el after all! Or I'll just keep the current
in-buffer string-bytes approach and wait for something to explode.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: string-bytes and coding systems
  2017-03-10 16:37     ` Eric Abrahamsen
@ 2017-03-10 18:26       ` Stefan Monnier
  2017-03-10 18:56         ` Eric Abrahamsen
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Monnier @ 2017-03-10 18:26 UTC (permalink / raw)
  To: help-gnu-emacs

> Hmm, maybe it will be bindat.el after all! Or I'll just keep the current
> in-buffer string-bytes approach and wait for something to explode.

Basically, as long as the data is truly utf-8, string-bytes should give
you the correct answer.  The differences should only occur on characters
which are outside of utf-8 proper (used to represent invalid utf-8 data).


        Stefan




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: string-bytes and coding systems
  2017-03-10 18:26       ` Stefan Monnier
@ 2017-03-10 18:56         ` Eric Abrahamsen
  2017-03-10 19:10           ` Stefan Monnier
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Abrahamsen @ 2017-03-10 18:56 UTC (permalink / raw)
  To: help-gnu-emacs

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> Hmm, maybe it will be bindat.el after all! Or I'll just keep the current
>> in-buffer string-bytes approach and wait for something to explode.
>
> Basically, as long as the data is truly utf-8, string-bytes should give
> you the correct answer.  The differences should only occur on characters
> which are outside of utf-8 proper (used to represent invalid utf-8 data).

Sounds close enough for me.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: string-bytes and coding systems
  2017-03-10 18:56         ` Eric Abrahamsen
@ 2017-03-10 19:10           ` Stefan Monnier
  0 siblings, 0 replies; 16+ messages in thread
From: Stefan Monnier @ 2017-03-10 19:10 UTC (permalink / raw)
  To: help-gnu-emacs

>>> Hmm, maybe it will be bindat.el after all! Or I'll just keep the current
>>> in-buffer string-bytes approach and wait for something to explode.
>> Basically, as long as the data is truly utf-8, string-bytes should give
>> you the correct answer.  The differences should only occur on characters
>> which are outside of utf-8 proper (used to represent invalid utf-8 data).
> Sounds close enough for me.

That's also my opinion.


        Stefan




^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-03-10 19:10 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-03-08 23:17 string-bytes and coding systems Eric Abrahamsen
2017-03-09  7:46 ` hector
2017-03-09  7:54   ` Yuri Khan
2017-03-09  9:23     ` hector
2017-03-09 17:36       ` Eric Abrahamsen
2017-03-10  4:39         ` Thien-Thi Nguyen
2017-03-10  6:36           ` Eric Abrahamsen
2017-03-10  4:59         ` Alexis
2017-03-10  6:10           ` Eric Abrahamsen
2017-03-09 16:01 ` Eli Zaretskii
2017-03-09 17:35   ` Eric Abrahamsen
2017-03-10  9:02   ` Stefan Monnier
2017-03-10 16:37     ` Eric Abrahamsen
2017-03-10 18:26       ` Stefan Monnier
2017-03-10 18:56         ` Eric Abrahamsen
2017-03-10 19:10           ` Stefan Monnier

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.