From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: Unibyte characters, strings, and buffers
Date: Sat, 29 Mar 2014 13:44:57 +0300
Message-ID: <83ppl5e11y.fsf@gnu.org>
References: <831txozsqa.fsf@gnu.org> <jwv4n2j2141.fsf-monnier+emacs@gnu.org>
	<83ppl7y30l.fsf@gnu.org> <87r45nouvx.fsf@uwakimon.sk.tsukuba.ac.jp>
	<8361myyac6.fsf@gnu.org> <87a9capqfr.fsf@uwakimon.sk.tsukuba.ac.jp>
	<83eh1mfd09.fsf@gnu.org> <87ob0pnyt6.fsf@uwakimon.sk.tsukuba.ac.jp>
Reply-To: Eli Zaretskii <eliz@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
X-Trace: ger.gmane.org 1396089909 20339 80.91.229.3 (29 Mar 2014 10:45:09 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sat, 29 Mar 2014 10:45:09 +0000 (UTC)
Cc: monnier@IRO.UMontreal.CA, emacs-devel@gnu.org
To: "Stephen J. Turnbull" <stephen@xemacs.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Mar 29 11:45:18 2014
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1WTqlJ-0003C4-3o
	for ged-emacs-devel@m.gmane.org; Sat, 29 Mar 2014 11:45:17 +0100
Original-Received: from localhost ([::1]:38389 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1WTqlI-0005L9-Pa
	for ged-emacs-devel@m.gmane.org; Sat, 29 Mar 2014 06:45:16 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:43787)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1WTqlB-0005Jc-QL
	for emacs-devel@gnu.org; Sat, 29 Mar 2014 06:45:14 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1WTql6-0007RV-UP
	for emacs-devel@gnu.org; Sat, 29 Mar 2014 06:45:09 -0400
Original-Received: from mtaout24.012.net.il ([80.179.55.180]:59610)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
	id 1WTql6-0007OG-Bh
	for emacs-devel@gnu.org; Sat, 29 Mar 2014 06:45:04 -0400
Original-Received: from conversion-daemon.mtaout24.012.net.il by mtaout24.012.net.il
	(HyperSendmail v2007.08) id
	<0N3700G001N7Z200@mtaout24.012.net.il> for emacs-devel@gnu.org;
	Sat, 29 Mar 2014 13:43:01 +0300 (IDT)
Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by mtaout24.012.net.il
	(HyperSendmail v2007.08) with ESMTPA id
	<0N3700GQT1RGLX00@mtaout24.012.net.il>;
	Sat, 29 Mar 2014 13:43:01 +0300 (IDT)
In-reply-to: <87ob0pnyt6.fsf@uwakimon.sk.tsukuba.ac.jp>
X-012-Sender: halo1@inter.net.il
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 80.179.55.180
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:171128
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/171128>

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: monnier@IRO.UMontreal.CA,
>     emacs-devel@gnu.org
> Date: Sat, 29 Mar 2014 18:23:17 +0900
>=20
> Eli Zaretskii writes:
>=20
>  > This thread is about different issues.
>=20
> *sigh*  No, it's about unibyte being a premature pessimization.

*Sigh*, indeed.

>  > >  > Likewise examples from XEmacs, since the differences in thi=
s area
>  > >  > between Emacs and XEmacs are substantial, and that preclude=
s useful
>  > >  > comparison.
>  > >=20
>  > > "It works fine" isn't useful information?
>  >=20
>  > No, because it describes a very different implementation.
>=20
> Not at all.  The implementation of multibyte buffers is very simila=
r.

Says you.  But I cannot talk intelligently about that, because I don'=
t
know the details.  And it sounds like you cannot talk about the issue
at hand, because you don't know the details of Emacs handling of raw
bytes.  This discussion is about Emacs's unibyte buffers and strings,
so it isn't going to yield any useful insights by you talking about
XEmacs implementation without knowing what is Emacs's one, and me the
other way around.  That is why I asked not to bring the XEmacs
implementation into this discussion.

> What's different is that Emacs complifusticates matters by also hav=
ing
> a separate implementation of unibyte buffers, and then basically
> making a union out of the two structures called "buffer".  XEmacs
> simply implements binary as a particular coding system in and out o=
f
> multibyte buffers.

In Emacs, a coding system is only consulted when a buffer is read or
written.  If you also consult it when inserting text into it, or when
deciding whether 'downcase' should or shouldn't change the character
=66rom the buffer, then you still have unibyte buffers in disguise, y=
ou
just call them "buffers whose coding system is 'binary'".

>  > Then I guess you will have to suggest how to implement this with=
out
>  > unibyte buffers.
>=20
> No, I don't.  I already told you how to do it: nuke unibyte buffers
> and use iso-8859-1-unix as the binary codec.

"Codec" is XEmacs terminology, I don't understand what that means in
practice, when applied to Emacs.  If it means the same as coding
system, then how can iso-8859-1-unix byte-stream be decoded into, say=
,
Cyrillic characters (assuming the byte-stream was actually UTF-8
encoded Cyrillic text)?

> Then you're done, except for those applications that actually make
> the mistake of using unibyte text explicitly.

What does "explicitly" mean in this context?  Can you show an example
of "explicit" vs "implicit" use of unibyte text?

>  > >  > In such unibyte buffers, we need a way to represent raw byt=
es, which
>  > >  > are parts of as yet un-decoded byte sequences that represen=
t encoded
>  > >  > characters.
>  > >=20
>  > > Again, I disagree.  Unibyte is a design mistake, and unnecessa=
ry.
>  >=20
>  > Then what do you call a buffer whose "text" is encoded?
>=20
> "Binary."

That's just a different name.  If "binary" buffers are treated
differently from any other kind, when processing characters from them=
,
then they are just unibyte buffers in disguise.

>  > > XEmacs proves it -- we use (essentially) the same code in many
>  > > applications (VM, Gnus for two mbox-using examples) as GNU Ema=
cs does.
>  >=20
>  > I asked you not to bring XEmacs into the discussion, because I c=
annot
>  > talk intelligently about its implementation.  If you insist on d=
oing
>  > that, this discussion is futile from my POV.
>=20
> The whole point here is that exactly what the XEmacs implementation=
 is
> *irrelevant*.  The point that we implement the same API as GNU Emac=
s
> without unibyte buffers or the annoyances and incoherence that come=
s
> with them.

Without knowing the details of the implementation, it is impossible t=
o
talk about merits and demerits of each design and implementation.
Therefore, bringing into this discussion XEmacs implementation withou=
t
describing it in all detail does not help.  Excuse me, but I don't
believe you when you say you have no problems at all in this area,
just because you say that.  If you want that to count, you will have
to delve into the gory details, and then show why and how the problem=
s
are avoided.

>  > > For heaven's sake, we've had `buffer-as-{multi,uni}-byte defin=
ed as
>  > > no-ops forever
>  >=20
>  > I wasn't talking about those functions.  I was talking about the=
 need
>  > to have unibyte buffers and strings.
>=20
> There is no "need for unibyte."  You're simply afraid to throw it a=
way.

I'm not afraid of anything of the kind.  This discussion was started
in order to try figuring out how to get rid of unibyte.  If you want
to help, offer specific technical solutions to specific issues we hav=
e
in Emacs.  Copying the XEmacs implementation, even if we were sure it
resolves the problem (and I'm not at all sure), is impractical.

>  > How is it different?  What would be the encoding of a buffer tha=
t
>  > contains raw bytes?
>=20
> Depends.  If it's uninterpreted bytes, "binary."  If those are
> undecodable bytes, they'll be the representation of raw bytes that
> occurred in an otherwise sane encoded stream, and the buffer's
> encoding will be the nominal encoding of that stream.  If you want =
to
> ensure sanity of output, then you will use an output encoding that
> errors on rawbytes, and a program that cleans up those rawbytes in =
a
> way appropriate for the application.  If you expect the next progra=
m
> in the pipeline to handle them, then you use a variant encoding tha=
t
> just encodes them back to the original undecodable rawbytes.

That's exactly what Emacs does, so I think you rather agree to what I
originally described as requirements and you said you disagreed.

>  > But that's ridiculous: a raw byte is just a single byte, so
>  > string-bytes should return a meaningful value for a string of su=
ch
>  > bytes.
>=20
> `string-bytes' should not exist.  As I wrote earlier:
>=20
>  > > You don't need `string-bytes' unless you've exposed internal
>  > > representation to Lisp, then you desperately need it to write =
correct
>  > > code (which some users won't be able to do anyway without help=
, cf.=20
>  > > https://groups.google.com/forum/#!topic/comp.emacs/IRKeteTzfbk=
).  So
>  > > *don't expose internal representation* (and the hammer marks o=
n users'
>  > > foreheads will disappear in due time, and the headaches even f=
aster!)
>  >=20
>  > How else would you know how many bytes will a string take on dis=
k?
>=20
> How does `string-bytes' help?

It returns that information.

> You don't know what encoding will be used to write them

Yes, I do know: the buffer's coding system tells me.  And if text is
already encoded, then I know no additional encoding will be applied,
and whatever string-bytes tells me is it.

> If you use iso-8859-1-unix as the coding system, then "bytes on the
> wire" =3D=3D "characters in the string".  No problema, se=C3=B1or.

Not if you want to recode the string in, say, UTF-8.  When you shuffl=
e
text from one buffer to another, Emacs does not track which encoding
that text came from, so the iso-8859-1-unix information is lost.

>  > >  > So here you have already at least 2 valid reasons
>  > >=20
>  > > No, *you* have them.  XEmacs works perfectly well without them=
, using
>  > > code written for Emacs.
>  >=20
>  > XEmacs also works "perfectly well" without bidi and other stuff.=
  That
>  > doesn't help at all in this discussion.
>=20
> You're right: because XEmacs doesn't handle bidi, it's irrelevant t=
o
> this discussion.  Why did *you* bring it up?

To show how your way of arguing doesn't help.

> What is relevant is how to represent byte streams in Emacs.  The
> obvious non-unibyte way is a one-to-one mapping of bytes to Unicode
> characters.  It is *extremely* convenient if the first 128 of those
> bytes correspond to the ASCII coded character set, because so many
> wire protocols use ASCII "words" syntactically.  The other 128 don'=
t
> matter much, so why not just use the extremely convenient Latin-1 s=
et
> for them?

Because there are situations when the effect of this is not what Lisp
programs and users expect.  Case folding and case-insensitive search
is one of them, although not the only one.

>  > >  > If we want to get rid of unibyte, Someone(TM) should presen=
t a
>  > >  > complete practical solution to those two problems (and a fe=
w
>  > >  > others), otherwise, this whole discussion leads nowhere.
>  > >=20
>  > > Complete practical solution: "They are non-problems, forget ab=
out
>  > > them, and rewrite any code that implies you need to remember t=
hem."
>  >=20
>  > That a slogan, not a solution.
>=20
> No, it is a precise high-level design for a solution.

We need a low-level design, not high-level.

>  > > Fortunately for me, I am *intimately* familiar with XEmacs int=
ernals,
>  > > and therefore RMS won't let me write this code for Emacs. :-)
>  >=20
>  > Then perhaps you shouldn't be part of this discussion.
>=20
> Since I've been invited to leave, I will.  My point is sufficiently
> well-made for open minds to deal with the details.

No, it isn't made at all.  I tried to explain above why I think so.

>  > > Which is precisely why we're having this thread.  If there wer=
e *no*
>  > > Lisp-visibile unibyte buffers or strings, it couldn't possibly=
 matter.
>  >=20
>  > And if I had $5M on by bank account, I'd probably be elsewhere
>  > enjoying myself.  IOW, how are "if there were no..." arguments u=
seful?
>=20
> Because they point out that this thread wouldn't have happened with=
 a
> different design.

But we _are_ with this design, and have been using it for the last 15
years.  Good luck believing that someone will come and replace the
existing design with something radically different.  There wasn't a
comparable revolution in Emacs since 2001, so I largely doubt that
expecting another one any time soon is wise.  We don't even have
people aboard capable of making such changes.

The only practical way of advancing in this area is by low-level
changes that don't throw away the high-level design.  That is why
precisely describing the details of every proposal is so important:
without them, any proposal becomes impractical and thus not
interesting.

>  > This is not a discussion about whose model is better, Emacs or X=
Emacs.
>  > This is a discussion of whether and how can we remove unibyte bu=
ffers,
>  > strings, and characters from Emacs.  You must start by understan=
ding
>  > how are they used in Emacs 24, and then suggest practical ways t=
o
>  > change that.
>=20
> Well, I would have said "tell me about it"

And I would have replied "sorry, I have no time for that".  The
sources are there to be studied, and you are welcome to ask questions
about stuff you don't understand just by looking at the sources.

There cannot be any useful discussion of these matters without
thorough understanding of how Emacs stores characters and raw bytes i=
n
its buffers, and where and how the unibyte nuisance comes into play.

> I will say nothing you've said so far even hints at issues with
> simply removing the whole concept of unibyte.

I started by describing some basic requirements that lead to unibyte.
You refuse to even acknowledge those requirements.  How can we
continue a useful discussion when we don't even agree about the
basics?  To convince me, you need first to take my view of the issue,
something that you refuse to do.  I cannot begin to explain "the
issues" to you if you don't even agree with my starting point.

>  > In Emacs, 'insert' does some pretty subtle stuff with unibyte bu=
ffers
>  > and characters.  If you use it, you get what it does.
>=20
> And I'm telling you those subtleties are a *problem* that solves
> nothing that an Emacs without a unibyte concept can't handle fine.

You keep saying that, but without the details (which you cannot or
won't provide), these are just slogans with little technical value.

>  > If the buffer is not marked specially, how will I know to avoid
>  > [inserting non-Latin-1 characters in a "binary" buffer]?
>=20
> All experience with XEmacs says *you* (the human programmer) *won't=
*
> have any problem avoiding that.  As a programmer, if you're working
> with a binary protocol, you will be using binary buffers and string=
s,
> and byte-sized integers.  If you accidentally mix things up, you'll
> quickly get an encoding error on output (since the binary codec can=
't
> output non-Latin-1 Unicode characters.

On this level, it sounds like XEmacs does things exactly like Emacs
does, it just calls them differently.  If so, you have the same
problems; e.g., what will 'downcase-word' do in a "binary" buffer,
when it sees a "character" whose value is 192?

> It's just not a problem in practice, and that's not why unibyte was
> introduced in Emacs anyway.  Unibyte was introduced because some fo=
lks
> thought working with variable-width-encoded buffers was too
> inefficient so they wanted access to a flat buffer of bytes.  That'=
s
> why buffer-as-{uni,multi}byte type punning was included.

Maybe so, but we are now 15 years after that, so history is only
marginally important.  What _is_ important is how to get rid of the
issues we have, without a complete redesign.

>  > And I still don't see how this is relevant.  You are describing =
a
>  > marginally valid use case, while I'm talking about use cases we =
meet
>  > every day, and which must be supported, e.g. when some Lisp want=
s to
>  > decode or encode text by hand.
>=20
> You use `encode-coding-region' and `decode-coding-region', same as =
you
> do now.  Do you seriously think that XEmacs doesn't support those u=
se
> cases?

"Support" doesn't mean "there're no issues".  Emacs supports them as
well, you know.  That fact in itself doesn't help at all in this
discussion, because we all know (I hope) that at this "slogan level"
things work very well for quite some time.