* string-bytes and coding systems
@ 2017-03-08 23:17 Eric Abrahamsen
2017-03-09 7:46 ` hector
2017-03-09 16:01 ` Eli Zaretskii
0 siblings, 2 replies; 16+ messages in thread
From: Eric Abrahamsen @ 2017-03-08 23:17 UTC (permalink / raw)
To: help-gnu-emacs
I'm writing a function that's supposed to wrap too-long text lines; the
RFC says anything over 75 octets (excluding eol) needs to be wrapped,
but multibyte characters must not be split.
Everything seems to be working fine, but I want to make sure I'm not
making any dangerous assumptions about `string-bytes' and encoding.
I'm essentially taking the `string-bytes' of each line, and if it's too
long, popping characters off the end until it's fewer than 75 bytes.
My understanding/assumption is that `string-bytes' returns the number of
bytes according to Emacs' internal coding system, which is close enough
to utf-8 to make no difference. When this text gets written to file it
will also be encoded as utf-8, ergo testing string lengths with
`string-bytes' is going to always produce the right results in the final
file.
Have I understood things correctly?
Thanks!
Eric
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: string-bytes and coding systems
2017-03-08 23:17 string-bytes and coding systems Eric Abrahamsen
@ 2017-03-09 7:46 ` hector
2017-03-09 7:54 ` Yuri Khan
2017-03-09 16:01 ` Eli Zaretskii
1 sibling, 1 reply; 16+ messages in thread
From: hector @ 2017-03-09 7:46 UTC (permalink / raw)
To: Eric Abrahamsen; +Cc: help-gnu-emacs
On Wed, Mar 08, 2017 at 03:17:07PM -0800, Eric Abrahamsen wrote:
> I'm writing a function that's supposed to wrap too-long text lines; the
> RFC says anything over 75 octets (excluding eol) needs to be wrapped,
> but multibyte characters must not be split.
Why don't you just use fill-paragraph?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: string-bytes and coding systems
2017-03-09 7:46 ` hector
@ 2017-03-09 7:54 ` Yuri Khan
2017-03-09 9:23 ` hector
0 siblings, 1 reply; 16+ messages in thread
From: Yuri Khan @ 2017-03-09 7:54 UTC (permalink / raw)
To: hector; +Cc: Eric Abrahamsen, help-gnu-emacs@gnu.org
On Thu, Mar 9, 2017 at 2:46 PM, hector <hectorlahoz@gmail.com> wrote:
> On Wed, Mar 08, 2017 at 03:17:07PM -0800, Eric Abrahamsen wrote:
>> I'm writing a function that's supposed to wrap too-long text lines; the
>> RFC says anything over 75 octets (excluding eol) needs to be wrapped,
>> but multibyte characters must not be split.
>
> Why don't you just use fill-paragraph?
Because that works in terms of characters, not octets?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: string-bytes and coding systems
2017-03-09 7:54 ` Yuri Khan
@ 2017-03-09 9:23 ` hector
2017-03-09 17:36 ` Eric Abrahamsen
0 siblings, 1 reply; 16+ messages in thread
From: hector @ 2017-03-09 9:23 UTC (permalink / raw)
To: Yuri Khan; +Cc: help-gnu-emacs
On Thu, Mar 09, 2017 at 02:54:16PM +0700, Yuri Khan wrote:
> On Thu, Mar 9, 2017 at 2:46 PM, hector <hectorlahoz@gmail.com> wrote:
> > On Wed, Mar 08, 2017 at 03:17:07PM -0800, Eric Abrahamsen wrote:
> >> I'm writing a function that's supposed to wrap too-long text lines; the
> >> RFC says anything over 75 octets (excluding eol) needs to be wrapped,
> >> but multibyte characters must not be split.
> >
> > Why don't you just use fill-paragraph?
>
> Because that works in terms of characters, not octets?
Right.
It would be weird for a user command to work in terms of octets, wouldn't it?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: string-bytes and coding systems
2017-03-08 23:17 string-bytes and coding systems Eric Abrahamsen
2017-03-09 7:46 ` hector
@ 2017-03-09 16:01 ` Eli Zaretskii
2017-03-09 17:35 ` Eric Abrahamsen
2017-03-10 9:02 ` Stefan Monnier
1 sibling, 2 replies; 16+ messages in thread
From: Eli Zaretskii @ 2017-03-09 16:01 UTC (permalink / raw)
To: help-gnu-emacs
> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Wed, 08 Mar 2017 15:17:07 -0800
>
> I'm essentially taking the `string-bytes' of each line, and if it's too
> long, popping characters off the end until it's fewer than 75 bytes.
>
> My understanding/assumption is that `string-bytes' returns the number of
> bytes according to Emacs' internal coding system
Yes.
> which is close enough to utf-8 to make no difference.
No. The deviations from UTF-8 could be significant in some cases,
with some exotic characters and with raw bytes.
> When this text gets written to file it will also be encoded as
> utf-8, ergo testing string lengths with `string-bytes' is going to
> always produce the right results in the final file.
I suggest to use filepos-to-bufferpos to find where to break text into
lines.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: string-bytes and coding systems
2017-03-09 16:01 ` Eli Zaretskii
@ 2017-03-09 17:35 ` Eric Abrahamsen
2017-03-10 9:02 ` Stefan Monnier
1 sibling, 0 replies; 16+ messages in thread
From: Eric Abrahamsen @ 2017-03-09 17:35 UTC (permalink / raw)
To: help-gnu-emacs
Eli Zaretskii <eliz@gnu.org> writes:
>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Date: Wed, 08 Mar 2017 15:17:07 -0800
>>
>> I'm essentially taking the `string-bytes' of each line, and if it's too
>> long, popping characters off the end until it's fewer than 75 bytes.
>>
>> My understanding/assumption is that `string-bytes' returns the number of
>> bytes according to Emacs' internal coding system
>
> Yes.
>
>> which is close enough to utf-8 to make no difference.
>
> No. The deviations from UTF-8 could be significant in some cases,
> with some exotic characters and with raw bytes.
Good to know.
>> When this text gets written to file it will also be encoded as
>> utf-8, ergo testing string lengths with `string-bytes' is going to
>> always produce the right results in the final file.
>
> I suggest to use filepos-to-bufferpos to find where to break text into
> lines.
I'll look into that. Thank you!
Eric
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: string-bytes and coding systems
2017-03-09 9:23 ` hector
@ 2017-03-09 17:36 ` Eric Abrahamsen
2017-03-10 4:39 ` Thien-Thi Nguyen
2017-03-10 4:59 ` Alexis
0 siblings, 2 replies; 16+ messages in thread
From: Eric Abrahamsen @ 2017-03-09 17:36 UTC (permalink / raw)
To: help-gnu-emacs
hector <hectorlahoz@gmail.com> writes:
> On Thu, Mar 09, 2017 at 02:54:16PM +0700, Yuri Khan wrote:
>> On Thu, Mar 9, 2017 at 2:46 PM, hector <hectorlahoz@gmail.com> wrote:
>> > On Wed, Mar 08, 2017 at 03:17:07PM -0800, Eric Abrahamsen wrote:
>> >> I'm writing a function that's supposed to wrap too-long text lines; the
>> >> RFC says anything over 75 octets (excluding eol) needs to be wrapped,
>> >> but multibyte characters must not be split.
>> >
>> > Why don't you just use fill-paragraph?
>>
>> Because that works in terms of characters, not octets?
>
> Right.
> It would be weird for a user command to work in terms of octets, wouldn't it?
It's not really a user command, I'm exporting vCard objects to a file.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: string-bytes and coding systems
2017-03-09 17:36 ` Eric Abrahamsen
@ 2017-03-10 4:39 ` Thien-Thi Nguyen
2017-03-10 6:36 ` Eric Abrahamsen
2017-03-10 4:59 ` Alexis
1 sibling, 1 reply; 16+ messages in thread
From: Thien-Thi Nguyen @ 2017-03-10 4:39 UTC (permalink / raw)
To: help-gnu-emacs
[-- Attachment #1: Type: text/plain, Size: 608 bytes --]
() Eric Abrahamsen <eric@ericabrahamsen.net>
() Thu, 09 Mar 2017 09:36:03 -0800
It's not really a user command, I'm exporting vCard objects
to a file.
I'm interested in having bindat.el exercised more, hence the
tangential question: Do you think this could be implemented
using bindat.el?
--
Thien-Thi Nguyen -----------------------------------------------
(defun responsep (query)
(pcase (context query)
(`(technical ,ml) (correctp ml))
...)) 748E A0E8 1CB8 A748 9BFA
--------------------------------------- 6CE4 6703 2224 4C80 7502
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: string-bytes and coding systems
2017-03-09 17:36 ` Eric Abrahamsen
2017-03-10 4:39 ` Thien-Thi Nguyen
@ 2017-03-10 4:59 ` Alexis
2017-03-10 6:10 ` Eric Abrahamsen
1 sibling, 1 reply; 16+ messages in thread
From: Alexis @ 2017-03-10 4:59 UTC (permalink / raw)
To: Eric Abrahamsen; +Cc: help-gnu-emacs
Eric Abrahamsen <eric@ericabrahamsen.net> writes:
> It's not really a user command, I'm exporting vCard objects to a file.
Might my `org-vcard` package be of any help to you?
https://github.com/flexibeast/org-vcard
It very much needs refactoring to separate out the vCard
input-and-output stuff from everything else - i've been struggling to
find the time and energy to do so - but maybe you'll find some of the
machinery useful?
Alexis.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: string-bytes and coding systems
2017-03-10 4:59 ` Alexis
@ 2017-03-10 6:10 ` Eric Abrahamsen
0 siblings, 0 replies; 16+ messages in thread
From: Eric Abrahamsen @ 2017-03-10 6:10 UTC (permalink / raw)
To: help-gnu-emacs
Alexis <flexibeast@gmail.com> writes:
> Eric Abrahamsen <eric@ericabrahamsen.net> writes:
>
>> It's not really a user command, I'm exporting vCard objects to a file.
>
> Might my `org-vcard` package be of any help to you?
>
> https://github.com/flexibeast/org-vcard
>
> It very much needs refactoring to separate out the vCard
> input-and-output stuff from everything else - i've been struggling to
> find the time and energy to do so - but maybe you'll find some of the
> machinery useful?
Yes! I didn't know this was out there -- it's always good to see other
people's approach to the same problems. Thanks!
It's true your package is closely tied to Org mode. I'm sure I'll still
be able to take some pointers from it, though. Thank god I'm not trying
to support vCard 2.1.
Here's where I'm at now. It does the escaping, but the line folding
(while implemented) isn't actually called; I'm still chewing on Eli's
pointers.
https://github.com/girzel/ebdb/blob/master/ebdb-vcard.el
Eric
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: string-bytes and coding systems
2017-03-10 4:39 ` Thien-Thi Nguyen
@ 2017-03-10 6:36 ` Eric Abrahamsen
0 siblings, 0 replies; 16+ messages in thread
From: Eric Abrahamsen @ 2017-03-10 6:36 UTC (permalink / raw)
To: help-gnu-emacs
Thien-Thi Nguyen <ttn@gnu.org> writes:
> () Eric Abrahamsen <eric@ericabrahamsen.net>
> () Thu, 09 Mar 2017 09:36:03 -0800
>
> It's not really a user command, I'm exporting vCard objects
> to a file.
>
> I'm interested in having bindat.el exercised more, hence the
> tangential question: Do you think this could be implemented
> using bindat.el?
I'm ashamed to say that I understand binary only enough to see why I
have problems with encoding. I aspire to someday use `logior' in anger.
I don't understand this library!
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: string-bytes and coding systems
2017-03-09 16:01 ` Eli Zaretskii
2017-03-09 17:35 ` Eric Abrahamsen
@ 2017-03-10 9:02 ` Stefan Monnier
2017-03-10 16:37 ` Eric Abrahamsen
1 sibling, 1 reply; 16+ messages in thread
From: Stefan Monnier @ 2017-03-10 9:02 UTC (permalink / raw)
To: help-gnu-emacs
>> When this text gets written to file it will also be encoded as
>> utf-8, ergo testing string lengths with `string-bytes' is going to
>> always produce the right results in the final file.
> I suggest to use filepos-to-bufferpos to find where to break text into
> lines.
BTW, filepos-to-bufferpos uses string-bytes (or equivalent data, with
the same caveat for raw bytes) for utf-8 buffers ;-)
Stefan
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: string-bytes and coding systems
2017-03-10 9:02 ` Stefan Monnier
@ 2017-03-10 16:37 ` Eric Abrahamsen
2017-03-10 18:26 ` Stefan Monnier
0 siblings, 1 reply; 16+ messages in thread
From: Eric Abrahamsen @ 2017-03-10 16:37 UTC (permalink / raw)
To: help-gnu-emacs
Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>> When this text gets written to file it will also be encoded as
>>> utf-8, ergo testing string lengths with `string-bytes' is going to
>>> always produce the right results in the final file.
>> I suggest to use filepos-to-bufferpos to find where to break text into
>> lines.
>
> BTW, filepos-to-bufferpos uses string-bytes (or equivalent data, with
> the same caveat for raw bytes) for utf-8 buffers ;-)
Hmm, maybe it will be bindat.el after all! Or I'll just keep the current
in-buffer string-bytes approach and wait for something to explode.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: string-bytes and coding systems
2017-03-10 16:37 ` Eric Abrahamsen
@ 2017-03-10 18:26 ` Stefan Monnier
2017-03-10 18:56 ` Eric Abrahamsen
0 siblings, 1 reply; 16+ messages in thread
From: Stefan Monnier @ 2017-03-10 18:26 UTC (permalink / raw)
To: help-gnu-emacs
> Hmm, maybe it will be bindat.el after all! Or I'll just keep the current
> in-buffer string-bytes approach and wait for something to explode.
Basically, as long as the data is truly utf-8, string-bytes should give
you the correct answer. The differences should only occur on characters
which are outside of utf-8 proper (used to represent invalid utf-8 data).
Stefan
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: string-bytes and coding systems
2017-03-10 18:26 ` Stefan Monnier
@ 2017-03-10 18:56 ` Eric Abrahamsen
2017-03-10 19:10 ` Stefan Monnier
0 siblings, 1 reply; 16+ messages in thread
From: Eric Abrahamsen @ 2017-03-10 18:56 UTC (permalink / raw)
To: help-gnu-emacs
Stefan Monnier <monnier@iro.umontreal.ca> writes:
>> Hmm, maybe it will be bindat.el after all! Or I'll just keep the current
>> in-buffer string-bytes approach and wait for something to explode.
>
> Basically, as long as the data is truly utf-8, string-bytes should give
> you the correct answer. The differences should only occur on characters
> which are outside of utf-8 proper (used to represent invalid utf-8 data).
Sounds close enough for me.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: string-bytes and coding systems
2017-03-10 18:56 ` Eric Abrahamsen
@ 2017-03-10 19:10 ` Stefan Monnier
0 siblings, 0 replies; 16+ messages in thread
From: Stefan Monnier @ 2017-03-10 19:10 UTC (permalink / raw)
To: help-gnu-emacs
>>> Hmm, maybe it will be bindat.el after all! Or I'll just keep the current
>>> in-buffer string-bytes approach and wait for something to explode.
>> Basically, as long as the data is truly utf-8, string-bytes should give
>> you the correct answer. The differences should only occur on characters
>> which are outside of utf-8 proper (used to represent invalid utf-8 data).
> Sounds close enough for me.
That's also my opinion.
Stefan
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2017-03-10 19:10 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-03-08 23:17 string-bytes and coding systems Eric Abrahamsen
2017-03-09 7:46 ` hector
2017-03-09 7:54 ` Yuri Khan
2017-03-09 9:23 ` hector
2017-03-09 17:36 ` Eric Abrahamsen
2017-03-10 4:39 ` Thien-Thi Nguyen
2017-03-10 6:36 ` Eric Abrahamsen
2017-03-10 4:59 ` Alexis
2017-03-10 6:10 ` Eric Abrahamsen
2017-03-09 16:01 ` Eli Zaretskii
2017-03-09 17:35 ` Eric Abrahamsen
2017-03-10 9:02 ` Stefan Monnier
2017-03-10 16:37 ` Eric Abrahamsen
2017-03-10 18:26 ` Stefan Monnier
2017-03-10 18:56 ` Eric Abrahamsen
2017-03-10 19:10 ` Stefan Monnier
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).