Compiling Elisp to a native code with a GCC plugin

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Compiling Elisp to a native code with a GCC plugin
@ 2010-09-14 19:12 Wojciech Meyer
  2010-09-14 19:32 ` Tom Tromey
  0 siblings, 1 reply; 97+ messages in thread
From: Wojciech Meyer @ 2010-09-14 19:12 UTC (permalink / raw)
  To: emacs-devel

Hi,

Recent version of GCC allow developing plugins.  That would solve JITing
intermediate representation (e.g. current bytecode) to the native code
across different platforms.  What do you think about it? Would that be
possible or there might be some problems that would make it impossible
(I am talking especially about dynamic scoping and GC interaction).

Wojciech

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 19:12 Compiling Elisp to a native code with a GCC plugin Wojciech Meyer
@ 2010-09-14 19:32 ` Tom Tromey
  2010-09-14 19:45   ` Wojciech Meyer
  2010-09-15 10:47   ` Leo
  0 siblings, 2 replies; 97+ messages in thread
From: Tom Tromey @ 2010-09-14 19:32 UTC (permalink / raw)
  To: Wojciech Meyer; +Cc: emacs-devel

>>>>> "Wojciech" == Wojciech Meyer <wojciech.meyer@googlemail.com> writes:

Wojciech> Recent version of GCC allow developing plugins.  That would
Wojciech> solve JITing intermediate representation (e.g. current
Wojciech> bytecode) to the native code across different platforms.  What
Wojciech> do you think about it? Would that be possible or there might
Wojciech> be some problems that would make it impossible (I am talking
Wojciech> especially about dynamic scoping and GC interaction).

I think you can compile elisp without needing a plugin.  Just convert
the bytecode to C, and compile that.  This is actually quite easy.

However, it is not very likely to result in a performance improvement
right now.  A similar idea was tried and did not work out:

    http://www.mundell.ukfsn.org/native/

I think this idea might be worth revisiting when lexbind is merged in,
emphasis on "might".  E.g., it seems to me that this approach might work
ok for the recently-discussed range-map code.

As I recall, in my profiles, the GC and the regexp matcher were more
costly the bytecode interpreter (though of course this is
workload-dependent).  If you are interested in performance, I suggest
doing your own profiles and starting there.

Tom

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 19:32 ` Tom Tromey
@ 2010-09-14 19:45   ` Wojciech Meyer
  2010-09-14 20:17     ` Lars Magne Ingebrigtsen
  2010-09-14 20:44     ` Tom Tromey
  2010-09-15 10:47   ` Leo
  1 sibling, 2 replies; 97+ messages in thread
From: Wojciech Meyer @ 2010-09-14 19:45 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Wojciech Meyer, emacs-devel

Tom Tromey <tromey@redhat.com> writes:

> I think you can compile elisp without needing a plugin.  Just convert
> the bytecode to C, and compile that.  This is actually quite easy.

Yes, but you can't JIT it in memory, and that also builds dependency on
a C compiler.

>
> However, it is not very likely to result in a performance improvement
> right now.  A similar idea was tried and did not work out:
>
>     http://www.mundell.ukfsn.org/native/
>
> I think this idea might be worth revisiting when lexbind is merged in,
> emphasis on "might".  E.g., it seems to me that this approach might work
> ok for the recently-discussed range-map code.

Well Elisp nature is dynamic, plus dynamic scoping makes it hard to
compile, but somewhat C Lispy code *can* work faster.

>
> As I recall, in my profiles, the GC and the regexp matcher were more
> costly the bytecode interpreter (though of course this is
> workload-dependent).  If you are interested in performance, I suggest
> doing your own profiles and starting there.

Mark and sweep is no good, it would be so good if we had generational
GC... :(

>
> Tom

Thanks,
Wojciech



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 19:45   ` Wojciech Meyer
@ 2010-09-14 20:17     ` Lars Magne Ingebrigtsen
  2010-09-14 20:52       ` Wojciech Meyer
  2010-09-14 20:55       ` Tom Tromey
  2010-09-14 20:44     ` Tom Tromey
  1 sibling, 2 replies; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-14 20:17 UTC (permalink / raw)
  To: emacs-devel

Wojciech Meyer <wojciech.meyer@googlemail.com> writes:

> Well Elisp nature is dynamic, plus dynamic scoping makes it hard to
> compile, but somewhat C Lispy code *can* work faster.

Sure.  Compiling to native code will probably yield some benefits, but I
think we tend to overestimate the benefits.

To take a random example from code I just wrote (part of
`gnus-range-nconcat'), which works on lists and numbers and stuff, and
is as low-level as Emacs Lisp code gets:

      (when (numberp (car last))
	(setcar last (cons (car last) (car last))))
      (if (= (1+ (cdar last)) (caar range))
	  (progn
	    (setcdr (car last) (cdar range))
	    (setcdr last (cdr range))))

Just imagine what that would be in native code, as opposed to byte code.
In either case, it'd just be a lot of calls to Fcar, Fcons, Fsetcar and
so on.  Would the byte-interpreter call those functions a lot slower
than native code would?  I kinda doubt it.

Now, the same code in native C would of course be a lot faster, because
you'd use other data types, and you wouldn't do the code by straight
list manipulation at all.  Possibly.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 20:17     ` Lars Magne Ingebrigtsen
@ 2010-09-14 20:52       ` Wojciech Meyer
  2010-09-14 20:55       ` Tom Tromey
  1 sibling, 0 replies; 97+ messages in thread
From: Wojciech Meyer @ 2010-09-14 20:52 UTC (permalink / raw)
  To: emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> To take a random example from code I just wrote (part of
> `gnus-range-nconcat'), which works on lists and numbers and stuff, and
> is as low-level as Emacs Lisp code gets:
>
>       (when (numberp (car last))
> 	(setcar last (cons (car last) (car last))))
>       (if (= (1+ (cdar last)) (caar range))
> 	  (progn
> 	    (setcdr (car last) (cdar range))
> 	    (setcdr last (cdr range))))
>
> Just imagine what that would be in native code, as opposed to byte code.
> In either case, it'd just be a lot of calls to Fcar, Fcons, Fsetcar and
> so on.  Would the byte-interpreter call those functions a lot slower
> than native code would?  I kinda doubt it.

Some of the functions will be in-lined, some of the data pointers will be
loaded to registers, some of the calls will be specialised against
constants, some of the expressions simplified, the flat code peep-holed
etc. So no, it is not direct translation even at the level of byte-code,
and compiler frameworks (gcc & llvm) are getting better and better at
optimising at high-level and low-level.

Wojciech



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 20:17     ` Lars Magne Ingebrigtsen
  2010-09-14 20:52       ` Wojciech Meyer
@ 2010-09-14 20:55       ` Tom Tromey
  2010-09-14 21:05         ` Wojciech Meyer
  1 sibling, 1 reply; 97+ messages in thread
From: Tom Tromey @ 2010-09-14 20:55 UTC (permalink / raw)
  To: emacs-devel

>>>>> "Lars" == Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

Lars> Just imagine what that would be in native code, as opposed to byte code.
Lars> In either case, it'd just be a lot of calls to Fcar, Fcons, Fsetcar and
Lars> so on.  Would the byte-interpreter call those functions a lot slower
Lars> than native code would?  I kinda doubt it.

Actually, in this case it might be faster, since there are special
opcodes like Bcar, Bcons, Bsetcar, etc.  Whether it is "faster enough"
is hard to guess; lexbind comes into play because with lexbind, the
various local bindings can just be C local variables, as opposed to much
more expensive elisp bindings.

Lars> Now, the same code in native C would of course be a lot faster, because
Lars> you'd use other data types, and you wouldn't do the code by straight
Lars> list manipulation at all.  Possibly.

IIUC, yeah, if you're willing to have a new pure-C data structure with a
new read and write syntax, you can do even better.

Tom

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 20:55       ` Tom Tromey
@ 2010-09-14 21:05         ` Wojciech Meyer
  0 siblings, 0 replies; 97+ messages in thread
From: Wojciech Meyer @ 2010-09-14 21:05 UTC (permalink / raw)
  To: Tom Tromey; +Cc: emacs-devel

Tom Tromey <tromey@redhat.com> writes:

> is hard to guess; lexbind comes into play because with lexbind, the
> various local bindings can just be C local variables, as opposed to much
> more expensive elisp bindings.

yep, because it just a run-time data structure, and no longer a pure stack
frame. So somewhat accessing directly memory needs to be much less
expensive.

Wojciech



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 19:45   ` Wojciech Meyer
  2010-09-14 20:17     ` Lars Magne Ingebrigtsen
@ 2010-09-14 20:44     ` Tom Tromey
  2010-09-14 21:00       ` Wojciech Meyer
  1 sibling, 1 reply; 97+ messages in thread
From: Tom Tromey @ 2010-09-14 20:44 UTC (permalink / raw)
  To: Wojciech Meyer; +Cc: emacs-devel

>>>>> "Wojciech" == Wojciech Meyer <wojciech.meyer@googlemail.com> writes:

Wojciech> Mark and sweep is no good, it would be so good if we had generational
Wojciech> GC... :(

It could be done.  It just requires someone willing to do the work.

Tom



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 20:44     ` Tom Tromey
@ 2010-09-14 21:00       ` Wojciech Meyer
  2010-09-14 21:16         ` Tom Tromey
  0 siblings, 1 reply; 97+ messages in thread
From: Wojciech Meyer @ 2010-09-14 21:00 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Wojciech Meyer, emacs-devel

Tom Tromey <tromey@redhat.com> writes:

>>>>>> "Wojciech" == Wojciech Meyer <wojciech.meyer@googlemail.com> writes:
>
> Wojciech> Mark and sweep is no good, it would be so good if we had generational
> Wojciech> GC... :(
>
> It could be done.  It just requires someone willing to do the work.

I know. I could get my old sources of generational garbage collector, to
work. However it is a daunting job (the worse I could imagine, garbage
collectors are nasty), plugging and debugging a new garbage collector to
such huge and esoteric (I am sure people that who've been working on
Emacs for years will not take this words badly and understand straight
away what I am (a newbie) talking about) project like Emacs. However I
might try to experiment with it (however unfortunately I am not that
self confident about it ;) ).

> Tom

Wojciech

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 21:00       ` Wojciech Meyer
@ 2010-09-14 21:16         ` Tom Tromey
  2010-09-14 21:29           ` Wojciech Meyer
  2010-09-14 23:13           ` Thomas Lord
  0 siblings, 2 replies; 97+ messages in thread
From: Tom Tromey @ 2010-09-14 21:16 UTC (permalink / raw)
  To: Wojciech Meyer; +Cc: emacs-devel

>>>>> "Wojciech" == Wojciech Meyer <wojciech.meyer@googlemail.com> writes:

Tom> It could be done.  It just requires someone willing to do the work.

Wojciech> I know. I could get my old sources of generational garbage
Wojciech> collector, to work. However it is a daunting job (the worse I
Wojciech> could imagine, garbage collectors are nasty), plugging and
Wojciech> debugging a new garbage collector to such huge and esoteric (I
Wojciech> am sure people that who've been working on Emacs for years
Wojciech> will not take this words badly and understand straight away
Wojciech> what I am (a newbie) talking about) project like
Wojciech> Emacs. However I might try to experiment with it (however
Wojciech> unfortunately I am not that self confident about it ;) ).

It is always ok to ask for help.

The current collector is very simple to understand.  If you read
alloc.c, and look through the data structures representing lisp objects
(in lisp.h), you will have a pretty good idea of what is going on.

FWIW, I looked at writing an incremental collector for Emacs.  I was
primarily interested in using software write barriers... this turns out
to be hard because there is a lot of code in Emacs of the form:

   FIELD_ACCESSOR (object) = value;

... which for a software barrier has to be converted to:

   SET_FIELD_ACCESSOR (object, value);

(There are other bad things, too, like passing around a Lisp_Object*
that points to the contents of a vector.)

So, lots of grunge work, just to get the point where you could start
actually working on the GC.  I would look at automated rewriting to
make this work -- that worked out great on the concurrent branch.

There was a more real attempt based on the Boehm GC.  I think the bits
from that are still on a branch.  This GC has a generational mode that,
IIRC, is based on memory protection bits.

Tom

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 21:16         ` Tom Tromey
@ 2010-09-14 21:29           ` Wojciech Meyer
  2010-09-14 21:59             ` Tom Tromey
  2010-09-14 23:13           ` Thomas Lord
  1 sibling, 1 reply; 97+ messages in thread
From: Wojciech Meyer @ 2010-09-14 21:29 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Wojciech Meyer, emacs-devel

Tom Tromey <tromey@redhat.com> writes:

>>>>>> "Wojciech" == Wojciech Meyer <wojciech.meyer@googlemail.com> writes:
>
> Tom> It could be done.  It just requires someone willing to do the work.
>
> Wojciech> I know. I could get my old sources of generational garbage
> Wojciech> collector, to work. However it is a daunting job (the worse I
> Wojciech> could imagine, garbage collectors are nasty), plugging and
> Wojciech> debugging a new garbage collector to such huge and esoteric (I
> Wojciech> am sure people that who've been working on Emacs for years
> Wojciech> will not take this words badly and understand straight away
> Wojciech> what I am (a newbie) talking about) project like
> Wojciech> Emacs. However I might try to experiment with it (however
> Wojciech> unfortunately I am not that self confident about it ;) ).
>
> It is always ok to ask for help.

Thanks, I will keep in mind it.

>
> The current collector is very simple to understand.  If you read
> alloc.c, and look through the data structures representing lisp objects
> (in lisp.h), you will have a pretty good idea of what is going on.

It's open already :).

>
>
> FWIW, I looked at writing an incremental collector for Emacs.  I was
> primarily interested in using software write barriers... this turns out
> to be hard because there is a lot of code in Emacs of the form:
>
>    FIELD_ACCESSOR (object) = value;
>
> ... which for a software barrier has to be converted to:
>
>    SET_FIELD_ACCESSOR (object, value);

Yep, we would need barriers for the second heap. For young heap it is OK
to just scan it.

>
> (There are other bad things, too, like passing around a Lisp_Object*
> that points to the contents of a vector.)
>
> So, lots of grunge work, just to get the point where you could start
> actually working on the GC.  I would look at automated rewriting to
> make this work -- that worked out great on the concurrent branch.

Maybe that work should be actually done even without thinking currently
about GC. AFAIR MT Emacs rewriting was in Elisp, ideally maybe using GCC
would be better at some point.

>
>
> There was a more real attempt based on the Boehm GC.  I think the bits
> from that are still on a branch.  This GC has a generational mode that,
> IIRC, is based on memory protection bits.
>

Conservative Boehm will not bring much gain (I would think certainly
loss) of performance. However I didn't know about marking pointers based
on memory protection bits and that sounds interesting. Need to look at
it. (however I am fully convinced that custom GC would be a superior
option).

> Tom

Wojciech



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 21:29           ` Wojciech Meyer
@ 2010-09-14 21:59             ` Tom Tromey
  2010-09-14 22:37               ` Wojciech Meyer
  2010-09-14 22:49               ` Wojciech Meyer
  0 siblings, 2 replies; 97+ messages in thread
From: Tom Tromey @ 2010-09-14 21:59 UTC (permalink / raw)
  To: Wojciech Meyer; +Cc: emacs-devel

>>>>> "Wojciech" == Wojciech Meyer <wojciech.meyer@googlemail.com> writes:

Tom> So, lots of grunge work, just to get the point where you could start
Tom> actually working on the GC.  I would look at automated rewriting to
Tom> make this work -- that worked out great on the concurrent branch.

Wojciech> Maybe that work should be actually done even without thinking
Wojciech> currently about GC. AFAIR MT Emacs rewriting was in Elisp,
Wojciech> ideally maybe using GCC would be better at some point.

You would have to hack GCC a little bit, because most of the code
locations you want to change arise from macro expansion, and GCC does
not keep all that information.  (Though there's a WIP patch for this.)

Maybe it could be done more simply using a simple parser in elisp that
recognizes just the needed forms.  Or maybe something based on clang.

This would get you most of the way there, though there are still some
bad things you have to fix up by hand.

For the concurrency stuff, we did two kinds of automated rewriting.

One was just pure elisp that searched the .c for DEFVAR_LISP and then
made various changes.

The other one modified the source (in a compile-breaking way), then ran
the compiler, then visited each error to perform a rewrite.  This
approach might also work for the GC problem, I am not certain.

These scripts are both in src/ on the concurrency branch.

One problem with any compiler-based approach is that it only works on
the sources it sees.  That is, the not-taken #if branches won't get
rewritten.  This argues for trying some kind of custom parser.

Another problem we ran into is that this approach doesn't work if the
problem code itself appears in a macro.  There were a few spots that we
had to fix by hand -- no big deal, the automation is still worthwhile
even if it only does 85% of the work.

My advice is to try to do this bulk rewriting work on head, so that it
doesn't rot.  I think that's been a problem for the concurrency work :-(

Tom

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 21:59             ` Tom Tromey
@ 2010-09-14 22:37               ` Wojciech Meyer
  2010-09-14 22:55                 ` Tom Tromey
  2010-09-14 22:49               ` Wojciech Meyer
  1 sibling, 1 reply; 97+ messages in thread
From: Wojciech Meyer @ 2010-09-14 22:37 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Wojciech Meyer, emacs-devel

Tom Tromey <tromey@redhat.com> writes:

>>>>>> "Wojciech" == Wojciech Meyer <wojciech.meyer@googlemail.com> writes:
>
> Tom> So, lots of grunge work, just to get the point where you could start
> Tom> actually working on the GC.  I would look at automated rewriting to
> Tom> make this work -- that worked out great on the concurrent branch.
>
> Wojciech> Maybe that work should be actually done even without thinking
> Wojciech> currently about GC. AFAIR MT Emacs rewriting was in Elisp,
> Wojciech> ideally maybe using GCC would be better at some point.
>
> You would have to hack GCC a little bit, because most of the code
> locations you want to change arise from macro expansion, and GCC does
> not keep all that information.  (Though there's a WIP patch for this.)

Yes, I am aware about impreciseness of this. Currently I may not think
about this for some other unrelated reasons as well...

>
> Maybe it could be done more simply using a simple parser in elisp that
> recognizes just the needed forms.  Or maybe something based on clang.
>
> This would get you most of the way there, though there are still some
> bad things you have to fix up by hand.
>
>
> For the concurrency stuff, we did two kinds of automated rewriting.

If you could point me out with the tools you used for this job, I would
be grateful, any points to git?

>
> One was just pure elisp that searched the .c for DEFVAR_LISP and then
> made various changes.
>
> The other one modified the source (in a compile-breaking way), then ran
> the compiler, then visited each error to perform a rewrite.  This
> approach might also work for the GC problem, I am not certain.

That is clever and cheap to do, and it is worth to try (and it looks a
bit like humans do re-factoring). However clang error messages are
currently are more precise than GCC (yes, we can match-replace regex,
and it will work fine in most cases).

>
> These scripts are both in src/ on the concurrency branch.
>
> One problem with any compiler-based approach is that it only works on
> the sources it sees.  That is, the not-taken #if branches won't get
> rewritten.  This argues for trying some kind of custom parser.

Yes, this is a major problem, different configurations, different
systems, and you are never sure if it will not break something, on some
machine (assuming that the changes are required to generate correct code).

>
> Another problem we ran into is that this approach doesn't work if the
> problem code itself appears in a macro.  There were a few spots that we
> had to fix by hand -- no big deal, the automation is still worthwhile
> even if it only does 85% of the work.

Definitely it is worthwhile, however I need to know what to rewrite at
first...

>
>
> My advice is to try to do this bulk rewriting work on head, so that it
> doesn't rot.  I think that's been a problem for the concurrency work
> :-(

Any chances to get it back to life? It would be nice to push it back...
If you would like to get it back, I would volunteer with any help.

If the maintainers are happy to accept patches with incremental
improvements then it is OK, however I still think in terms `I want - but
it will be hard and even nobody started it before'. If there will be any
progress on this I will just let everybody know. (anyway starting such
project is even OK for just *learning* about internals).

Shall we setup a public repo somewhere then?

I can commit there my generational gc as a sub-module, straight away.
(however it is not the best starting point, as was said).

>
> Tom

Thanks,
Wojciech

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 22:37               ` Wojciech Meyer
@ 2010-09-14 22:55                 ` Tom Tromey
  2010-09-14 23:33                   ` Wojciech Meyer
  0 siblings, 1 reply; 97+ messages in thread
From: Tom Tromey @ 2010-09-14 22:55 UTC (permalink / raw)
  To: Wojciech Meyer; +Cc: emacs-devel

Tom> For the concurrency stuff, we did two kinds of automated rewriting.

Wojciech> If you could point me out with the tools you used for this
Wojciech> job, I would be grateful, any points to git?

It is in the concurrency branch in Emacs bzr.  I assume this is mirrored
in git, but I'm not sure.

I can email you the two .el scripts if you like, plus the little GCC
patch I had to use to get the right location for the -> tokens.  Just
let me know.

If you check out the concurrency branch, the files are
src/hack-buffer-objfwd.el and src/rewrite-globals.el.

Wojciech> However clang error messages are currently are more precise
Wojciech> than GCC (yes, we can match-replace regex, and it will work
Wojciech> fine in most cases).

Recent versions of GCC are a lot better about marking the correct token
when emitting an error.  Red Hat put some work into this.

What hack-buffer-objfwd.el actually does is go to the location of the
error, then go backwards over sexps doing some pattern matching to see
where the left-hand-side of the -> operator starts.  Then it rewrites.
This is rather hacky and IIRC there are some Emacs-specific heuristics
in there.

Tom> My advice is to try to do this bulk rewriting work on head, so that it
Tom> doesn't rot.  I think that's been a problem for the concurrency work
Tom> :-(

Wojciech> Any chances to get it back to life?

I have zero (really negative) free time.  But yes, it can be
resurrected.  One starting point would be doing a merge from trunk.
Giuseppe did this but ran into problems... I still don't really know the
details, just that things changed on trunk in some conflicting way.

Wojciech> Shall we setup a public repo somewhere then?

We had one on gitorious.  But I think if you are going to try to work on
trunk you should just use bzr.  Or at least use the semi-official git
mirror when preparing patches.

Tom

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 22:55                 ` Tom Tromey
@ 2010-09-14 23:33                   ` Wojciech Meyer
  2010-09-15  1:38                     ` Tom Tromey
  0 siblings, 1 reply; 97+ messages in thread
From: Wojciech Meyer @ 2010-09-14 23:33 UTC (permalink / raw)
  To: Tom Tromey; +Cc: emacs-devel

Tom Tromey <tromey@redhat.com> writes:

> Tom> For the concurrency stuff, we did two kinds of automated rewriting.
>
> Wojciech> If you could point me out with the tools you used for this
> Wojciech> job, I would be grateful, any points to git?
>
> It is in the concurrency branch in Emacs bzr.  I assume this is mirrored
> in git, but I'm not sure.
>
> I can email you the two .el scripts if you like, plus the little GCC
> patch I had to use to get the right location for the -> tokens.  Just
> let me know.

Yes please, I would like to take a look, thanks.

>
> Wojciech> Any chances to get it back to life?
>
> I have zero (really negative) free time.  But yes, it can be
> resurrected.  One starting point would be doing a merge from trunk.
> Giuseppe did this but ran into problems... I still don't really know
> the details, just that things changed on trunk in some conflicting
> way.

Yes, I saw some of the work that had been done, and posts on ML.

>
> Wojciech> Shall we setup a public repo somewhere then?
>
> We had one on gitorious.  But I think if you are going to try to work
> on trunk you should just use bzr.  Or at least use the semi-official
> git mirror when preparing patches.

I am just not sure how efficiently and reliably branches work in bzr
(I've managed to screw up some of my work once with bzr), and I am not
sure how reliable are git mirrors. I quite like git, however there is a
cost of troubles with integration with bzr. On other hand I find forking
<-> merging unacceptable. I will try to mock something up, in the
meantime.

BTW: If somebody would like to enlighten me on how reliably mirrored git
works with the Emacs source tree, I would be grateful. Thanks.

Wojciech

PS: sorry this time sent from gmail.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 23:33                   ` Wojciech Meyer
@ 2010-09-15  1:38                     ` Tom Tromey
  0 siblings, 0 replies; 97+ messages in thread
From: Tom Tromey @ 2010-09-15  1:38 UTC (permalink / raw)
  To: Wojciech Meyer; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1580 bytes --]

Wojciech> Yes please, I would like to take a look, thanks.

I attached the scripts.  They have a few comments, but probably not
enough, given that they are pretty much one-off hacks.  (Though,
funnily, today I'm going to repurpose one to rewrite gdb...)

The appended patch is needed to get GCC to emit error locations on the
`->' token when the token appears in the arguments to a macro.

Wojciech> I am just not sure how efficiently and reliably branches work
Wojciech> in bzr (I've managed to screw up some of my work once with
Wojciech> bzr), and I am not sure how reliable are git mirrors. I quite
Wojciech> like git, however there is a cost of troubles with integration
Wojciech> with bzr. On other hand I find forking <-> merging
Wojciech> unacceptable. I will try to mock something up, in the
Wojciech> meantime.

Yeah, bzr is a pain compared to git.  But, we're stuck with it.

Wojciech> BTW: If somebody would like to enlighten me on how reliably
Wojciech> mirrored git works with the Emacs source tree, I would be
Wojciech> grateful. Thanks.

I think the mirror is updated regularly.  I'm not using it myself, but I
gather it works ok.

Tom

Index: macro.c
===================================================================
--- macro.c	(revision 164202)
+++ macro.c	(working copy)
@@ -1350,7 +1350,7 @@
 
   pfile->set_invocation_location = true;
   result = cpp_get_token (pfile);
-  if (pfile->context->macro)
+  if (pfile->context->macro && pfile->invocation_location > result->src_loc)
     *loc = pfile->invocation_location;
   else
     *loc = result->src_loc;


[-- Attachment #2: hack-buffer-objfwd.el --]
[-- Type: text/plain, Size: 6649 bytes --]

;; Rewrite all references to buffer-objfwd fields in struct buffer
;; to use accessor macros.
;; This works in a tricky way: it renames all such fields, then
;; recompiles Emacs.  Then it visits each error location and
;; rewrites the expressions.
;; This has a few requirements in order to work.
;; First, Emacs must compile before the script is run.
;; It does not handle errors arising for other reasons.
;; Second, you need a GCC which has been hacked to emit proper
;; column location even when the -> expression in question has
;; been wrapped in a macro call.  (This is a one-liner in libcpp.)
;; After running this script, a few changes need to be made by hand.
;; These occur mostly in macros in headers, but also in
;; reset_buffer and reset_buffer_local_variables.  Finally,
;; DEFVAR_PER_BUFFER and the GC should not use these accessors.

(defvar gcc-prefix "/home/tromey/gnu/Trunk/install/")

(defvar emacs-src "/home/tromey/gnu/Emacs/Gitorious/emacs-mt/src/")
(defvar emacs-build "/home/tromey/gnu/Emacs/Gitorious/build/src/")

(defun file-error (text)
  (error "%s:%d:%d: error: expected %s"
	 buffer-file-name (line-number-at-pos (point))
	 (current-column)
	 text))

(defun assert-looking-at (exp)
  (unless (looking-at exp)
    (file-error exp)))

(defvar field-names nil)

(defvar field-regexp nil)

(defun modify-buffer.h ()
  (message "Modifying fields in struct buffer")
  (find-file (expand-file-name "buffer.h" emacs-src))
  (goto-char (point-min))
  (re-search-forward "^struct buffer$")
  (forward-line)
  (assert-looking-at "^{")
  (let ((starting-point (point))
	(closing-brace (save-excursion
			 (forward-sexp)
			 (point))))
    ;; Find each field.
    (while (re-search-forward "^\\s *Lisp_Object\\s +"
			      closing-brace 'move)
      (goto-char (match-end 0))
      (while (not (looking-at ";"))
	(assert-looking-at "\\([A-Za-z0-9_]+\\)\\(;\\|,\\s *\\)")
	;; Remember the name so we can generate accessors.
	(push (match-string 1) field-names)
	;; Rename it.
	(goto-char (match-beginning 2))
	(insert "_")
	;; On to the next one, if any.
	(if (looking-at ",\\s *")
	    (goto-char (match-end 0)))))
    ;; Generate accessors.
    (goto-char starting-point)
    (forward-sexp)
    (forward-line)
    (insert "\n")
    (dolist (name field-names)
      (insert "#define BUF_" (upcase name) "(BUF) "
	      "*find_variable_location (&((BUF)->"
	      name "_))\n"))
    (insert "\n"))
  (setq field-regexp (concat "\\(->\\|\\.\\)"
			     (regexp-opt field-names t)
			     "\\_>"))
  (save-buffer))

(defun get-field-name ()
  (save-excursion
    (assert-looking-at "\\(\\.\\|->\\)\\([A-Za-z0-9_]+\\)\\_>")
    (prog1
	(match-string 2)
      (delete-region (match-beginning 0) (match-end 0)))))

(defun skip-backward-lhs ()
  (skip-chars-backward " \t\n")
  (cond
   ((eq (char-before) ?\])
    (file-error "array ref!")
    ;; fixme
    )
   ((eq (char-before) ?\))
    ;; A paren expression is preceding.
    ;; See if this is just a paren expression or whether it is a
    ;; function call.
    ;; For now assume that there are no function-calls-via-expr.
    (backward-sexp)
    (skip-chars-backward " \t\n")
    (if (save-excursion
	  (backward-char)
	  (looking-at "[A-Za-z0-9_]"))
	(backward-sexp)))
   ((save-excursion
      (backward-char)
      (looking-at "[A-Za-z0-9_]"))
    (backward-sexp))
   (t
    (file-error "unhandled case!"))))

(defun do-fix-instance ()
  (cond
   ((looking-at "->")
    (let ((field-name (get-field-name)))
      (insert ")")
      (backward-char)
      (skip-backward-lhs)
      (insert "BUF_" (upcase field-name) " (")))
   ((eq (char-after) ?.)
    (let ((field-name (get-field-name)))
      (insert ")")
      (backward-char)
      (backward-sexp)
      (assert-looking-at "\\(buffer_defaults\\|buffer_local_flags\\)")
      (insert "BUF_" (upcase field-name) " (&")))
   (t
    (message "%s:%d:%d: warning: did not see -> or ., probably macro"
	     buffer-file-name (line-number-at-pos (point))
	     (current-column)))))

(defun update-header-files ()
  (dolist (file (directory-files emacs-src t "h$"))
    (message "Applying header changes to %s" file)
    (find-file file)
    (while (re-search-forward
	    "\\(current_buffer->\\|buffer_defaults\\.\\)"
	    nil 'move)
      (goto-char (match-end 0))
      (skip-chars-backward "->.")
      (when (looking-at field-regexp)
	(do-fix-instance)))
    (goto-char (point-min))
    (while (search-forward "XBUFFER (" nil 'move)
      (goto-char (- (match-end 0) 1))
      (forward-sexp)
      ;; This works even for the new #define BUF_ macros
      ;; because the field-regexp ends with \_>.
      (when (looking-at field-regexp)
	(do-fix-instance)))
    (save-buffer)))

(defun fix-one-instance (filename line column)
  (message "%s:%d:%d: info: fixing instance" filename line column)
  (find-file filename)
  (goto-char (point-min))
  (forward-line (- line 1))
  ;; (move-to-column (- column 1))
  (forward-char (- column 1))
  (do-fix-instance))

(defvar make-accumulation "")

(defvar last-error-line nil)
(defvar error-list nil)

(defun make-filter (process string)
  (setq make-accumulation (concat make-accumulation string))
  (while (string-match "^[^\n]*\n" make-accumulation)
    (let ((line (substring (match-string 0 make-accumulation) 0 -1)))
      (setq make-accumulation (substring make-accumulation
					 (match-end 0)))
      (message "%s" line)
      (if (string-match "^\\([^:]+\\):\\([0-9]+\\):\\([0-9]+\\)+: error:"
			line)
	  (save-excursion
	    (let ((file-name (match-string 1 line))
		  (line-no (string-to-number (match-string 2 line)))
		  (col-no (string-to-number (match-string 3 line))))
	      ;; Process all errors on a given line in reverse order.
	      (unless (eq line-no last-error-line)
		(dolist (one-item error-list)
		  (apply #'fix-one-instance one-item))
		(setq error-list nil)
		(setq last-error-line line-no))
	      (push (list file-name line-no col-no) error-list)))))))

(defvar make-done nil)

(defun make-sentinel (process string)
  (dolist (one-item error-list)
    (apply #'fix-one-instance one-item))
  (setq make-done t))

(defun recompile-emacs ()
  (let* ((default-directory emacs-build)
	 (output-buffer (get-buffer-create "*recompile*"))
	 (make (start-process "make" output-buffer "make" "-k")))
    (set-process-filter make #'make-filter)
    (set-process-sentinel make #'make-sentinel)
    (while (not make-done)
      (accept-process-output))))

(modify-buffer.h)
(update-header-files)
(recompile-emacs)
(dolist (buf (buffer-list))
  (with-current-buffer buf
    (when buffer-file-name
      (message "Saving %s" buffer-file-name)
      (save-buffer))))

[-- Attachment #3: rewrite-globals.el --]
[-- Type: text/plain, Size: 2007 bytes --]

;; Rewrite DEFVAR_LISP variables.
;; Each variable is renamed to start with impl_.
;; Compatibility defines are added to globals.h.
;; Invoke as:  emacs --script rewrite-globals.el

(defvar defvar-list '())

(defun extract-defvars ()
  (let ((case-fold-search nil))
    (while (re-search-forward "^[^#*]*\\(DEFVAR_[A-Z_]*\\)" nil 'move)
      (let ((kind (match-string 1)))
	(unless (member kind '("DEFVAR_KBOARD" "DEFVAR_PER_BUFFER"))
	  ;; Skip the paren and the first argument.
	  (skip-chars-forward " (")
	  (forward-sexp)
	  (skip-chars-forward ", \t\n&")
	  (if (looking-at "\\_<\\(\\sw\\|\\s_\\)+\\_>")
	      (let ((var-name (match-string 0)))
		  (if (equal kind "DEFVAR_LISP")
		      (push var-name defvar-list)))))))))

(defun munge-V ()
  (interactive)
  (while (re-search-forward "^\\(extern \\|static \\)?Lisp_Object " nil 'move)
    ;; skip function decls.
    (if (not (looking-at ".*("))
	(while (looking-at "[a-z0-9A-Z_]+")
	  (if (member (match-string 0) defvar-list)
	      (progn
		;; Rename them all to impl_
		(goto-char (match-beginning 0))
		(insert "impl_")))
	  (forward-sexp)
	  (skip-chars-forward ", \t\n")))))

(defconst V-dir ".")

(defun munge-V-directory ()
  ;; First extract all defvars.
  (dolist (file (directory-files V-dir t "[ch]$"))
    (save-excursion
      (message "Scanning %s" file)
      (find-file file)
      (extract-defvars)))

  (setq defvar-list (delete-dups (sort defvar-list #'string<)))

  (dolist (file (directory-files V-dir t "[ch]$"))
    (save-excursion
      (message "Processing %s" file)
      (find-file file)
      (goto-char (point-min))
      (munge-V)
      (save-buffer)))

  (find-file "globals.h")
  (erase-buffer)
  (dolist (v defvar-list)
    (insert "#define " v " *find_variable_location (&impl_" v ")\n"))

  ;; A few special cases for globals.h.
  (insert "\n")
  (dolist (v '("do_mouse_tracking" "Vmark_even_if_inactive" "Vprint_level"))
    (insert "extern Lisp_Object impl_" v ";\n"))
  (save-buffer))

(munge-V-directory)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 21:59             ` Tom Tromey
  2010-09-14 22:37               ` Wojciech Meyer
@ 2010-09-14 22:49               ` Wojciech Meyer
  1 sibling, 0 replies; 97+ messages in thread
From: Wojciech Meyer @ 2010-09-14 22:49 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Wojciech Meyer, emacs-devel

Tom Tromey <tromey@redhat.com> writes:

> These scripts are both in src/ on the concurrency branch.

Ups I missed that. Thanks.

Wojciech



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 21:16         ` Tom Tromey
  2010-09-14 21:29           ` Wojciech Meyer
@ 2010-09-14 23:13           ` Thomas Lord
  2010-09-14 23:42             ` Wojciech Meyer
  1 sibling, 1 reply; 97+ messages in thread
From: Thomas Lord @ 2010-09-14 23:13 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Wojciech Meyer, emacs-devel

I've only loosely followed this thread so
its possible I'm off the mark (in which case - sorry)
but, i'm pretty sure it might be helpful to remark
about a variant of what I guess Tromey is suggesting:

Years ago - not for GC but for managing critical
sections wherein interrupts had to be deferred - 
we did something similar in a fork of GNU Guile.

In that case, semi-automated ad-hoc rewriting was
used a tiny bit but the most helpful thing turned
out to be:

a) rip a C grammar out of GCC (unless we used a 
different source, I forget).

b) hack the actions to hook up to a scheme (or
other lisp) run-time system and build an AST
as a big S-EXP.  Make sure this AST records source
files and line numbers.

c) write ad-hoc cheapo static analysis tools to 
walk the AST and find places where either it was
obvious fixes were needed, or where it was not obvious
fixes were not needed --- print those out like compiler
error messages.

d) Interactively page through those and, as you watch
each case, apply the ad hoc semi-automated rewrite tools
(or do it by hand in hard cases).

Step (d) can go very, very fast and, at least in that
case, steps (a .. c) can go a lot faster than
you might guess at first glance.

-t



On Tue, 2010-09-14 at 15:16 -0600, Tom Tromey wrote:
> >>>>> "Wojciech" == Wojciech Meyer <wojciech.meyer@googlemail.com> writes:
> 
> Tom> It could be done.  It just requires someone willing to do the work.
> 
> Wojciech> I know. I could get my old sources of generational garbage
> Wojciech> collector, to work. However it is a daunting job (the worse I
> Wojciech> could imagine, garbage collectors are nasty), plugging and
> Wojciech> debugging a new garbage collector to such huge and esoteric (I
> Wojciech> am sure people that who've been working on Emacs for years
> Wojciech> will not take this words badly and understand straight away
> Wojciech> what I am (a newbie) talking about) project like
> Wojciech> Emacs. However I might try to experiment with it (however
> Wojciech> unfortunately I am not that self confident about it ;) ).
> 
> It is always ok to ask for help.
> 
> The current collector is very simple to understand.  If you read
> alloc.c, and look through the data structures representing lisp objects
> (in lisp.h), you will have a pretty good idea of what is going on.
> 
> 
> FWIW, I looked at writing an incremental collector for Emacs.  I was
> primarily interested in using software write barriers... this turns out
> to be hard because there is a lot of code in Emacs of the form:
> 
>    FIELD_ACCESSOR (object) = value;
> 
> ... which for a software barrier has to be converted to:
> 
>    SET_FIELD_ACCESSOR (object, value);
> 
> (There are other bad things, too, like passing around a Lisp_Object*
> that points to the contents of a vector.)
> 
> So, lots of grunge work, just to get the point where you could start
> actually working on the GC.  I would look at automated rewriting to
> make this work -- that worked out great on the concurrent branch.
> 
> 
> There was a more real attempt based on the Boehm GC.  I think the bits
> from that are still on a branch.  This GC has a generational mode that,
> IIRC, is based on memory protection bits.
> 
> Tom
> 





^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 23:13           ` Thomas Lord
@ 2010-09-14 23:42             ` Wojciech Meyer
  0 siblings, 0 replies; 97+ messages in thread
From: Wojciech Meyer @ 2010-09-14 23:42 UTC (permalink / raw)
  To: Thomas Lord; +Cc: Tom Tromey, Wojciech Meyer, emacs-devel

Thomas Lord <lord@emf.net> writes:

> Years ago - not for GC but for managing critical
> sections wherein interrupts had to be deferred - 
> we did something similar in a fork of GNU Guile.
>
> In that case, semi-automated ad-hoc rewriting was
> used a tiny bit but the most helpful thing turned
> out to be:
>
> a) rip a C grammar out of GCC (unless we used a 
> different source, I forget).

GCC has a hand written recursive descent parser, so probably I would
need to use some other one (I do have one I think somewhere ;) )
And also macro definitions might be harder to handle.

>
> b) hack the actions to hook up to a scheme (or
> other lisp) run-time system and build an AST
> as a big S-EXP.  Make sure this AST records source
> files and line numbers.

We could get those from Clang XML output (I hope), and transform it with
xsltproc even to Sexp for easy loading.

>
> c) write ad-hoc cheapo static analysis tools to 
> walk the AST and find places where either it was
> obvious fixes were needed, or where it was not obvious
> fixes were not needed --- print those out like compiler
> error messages.
>
> d) Interactively page through those and, as you watch
> each case, apply the ad hoc semi-automated rewrite tools
> (or do it by hand in hard cases).
>
> Step (d) can go very, very fast and, at least in that
> case, steps (a .. c) can go a lot faster than
> you might guess at first glance.
>

Sounds straightforward to me.
Thanks for the tips.

But the worst thing, I still don't know what I will be rewriting! That's
the major problem here. (but I am willing to help improving existing
code base once i got to that point..).

> -t

Wojciech



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-14 19:32 ` Tom Tromey
  2010-09-14 19:45   ` Wojciech Meyer
@ 2010-09-15 10:47   ` Leo
  2010-09-15 11:41     ` Andreas Schwab
  2010-09-15 14:07     ` Stefan Monnier
  1 sibling, 2 replies; 97+ messages in thread
From: Leo @ 2010-09-15 10:47 UTC (permalink / raw)
  To: emacs-devel

On 2010-09-14 20:32 +0100, Tom Tromey wrote:
> As I recall, in my profiles, the GC and the regexp matcher were more
> costly the bytecode interpreter (though of course this is
> workload-dependent).

Regarding regexp matcher, do you know if performance will be improved by
using pcre?

Leo




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 10:47   ` Leo
@ 2010-09-15 11:41     ` Andreas Schwab
  2010-09-15 12:10       ` Wojciech Meyer
  2010-09-15 14:07     ` Stefan Monnier
  1 sibling, 1 reply; 97+ messages in thread
From: Andreas Schwab @ 2010-09-15 11:41 UTC (permalink / raw)
  To: Leo; +Cc: emacs-devel

Leo <sdl.web@gmail.com> writes:

> On 2010-09-14 20:32 +0100, Tom Tromey wrote:
>> As I recall, in my profiles, the GC and the regexp matcher were more
>> costly the bytecode interpreter (though of course this is
>> workload-dependent).
>
> Regarding regexp matcher, do you know if performance will be improved by
> using pcre?

You can't just switch to other regexp engines because they don't offer
all Emacs features.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 11:41     ` Andreas Schwab
@ 2010-09-15 12:10       ` Wojciech Meyer
  0 siblings, 0 replies; 97+ messages in thread
From: Wojciech Meyer @ 2010-09-15 12:10 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Leo, emacs-devel

On Wed, Sep 15, 2010 at 12:41 PM, Andreas Schwab <schwab@linux-m68k.org> wrote:
> Leo <sdl.web@gmail.com> writes:
>
>> On 2010-09-14 20:32 +0100, Tom Tromey wrote:
>>> As I recall, in my profiles, the GC and the regexp matcher were more
>>> costly the bytecode interpreter (though of course this is
>>> workload-dependent).
>>
>> Regarding regexp matcher, do you know if performance will be improved by
>> using pcre?
>
> You can't just switch to other regexp engines because they don't offer
> all Emacs features.

We could transform syntax from Elisp like to pcre using string substitution
on the fly. For the features not present in pcre just use different
set of functions,
regexp is self contained it is just a string.
Compiling regexp to a native code using gcc or GNU lightning (or any
other framework),
could also be a solution.

> Andreas.

Wojciech



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 10:47   ` Leo
  2010-09-15 11:41     ` Andreas Schwab
@ 2010-09-15 14:07     ` Stefan Monnier
  2010-09-15 14:27       ` Helmut Eller
  2010-09-15 21:04       ` Leo
  1 sibling, 2 replies; 97+ messages in thread
From: Stefan Monnier @ 2010-09-15 14:07 UTC (permalink / raw)
  To: Leo; +Cc: emacs-devel

>> As I recall, in my profiles, the GC and the regexp matcher were more
>> costly the bytecode interpreter (though of course this is
>> workload-dependent).
> Regarding regexp matcher, do you know if performance will be improved by
> using pcre?

Using a different regexp-engine might be a good idea.
But there are two issues:
- Emacs needs to be able to match on buffer text rather than only on
  strings.  Buffer text is made of 2 chunks of utf-8 byte arrays, so the
  regexp engine needs to be able to handle a whole in the middle of
  its input.
- The main problem with Emacs regexps right now is that they have
  pathological cases where the match-time is enormous (potentially
  exponential explosion in the size of the input string).  To be
  worthwhile a replacement should address this problem, which basically
  needs it should not be based on backtracking.

IIUC pcre suffers from the same 2nd problem, which in my book makes it
a poor candidate for replacement.


        Stefan



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 14:07     ` Stefan Monnier
@ 2010-09-15 14:27       ` Helmut Eller
  2010-09-15 14:59         ` Stefan Monnier
  2010-09-15 21:04       ` Leo
  1 sibling, 1 reply; 97+ messages in thread
From: Helmut Eller @ 2010-09-15 14:27 UTC (permalink / raw)
  To: emacs-devel

* Stefan Monnier [2010-09-15 14:07] writes:

> - The main problem with Emacs regexps right now is that they have
>   pathological cases where the match-time is enormous (potentially
>   exponential explosion in the size of the input string).  To be
>   worthwhile a replacement should address this problem, which basically
>   needs it should not be based on backtracking.

Is it possible (theoretically) to implement all of Emacs regexps without
backtracking?  In particular those with back-references (\N) seem
problematic.  Or is it necessary to recognize "optimizable" regexps
before using a different regexp engine?

Helmut




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 14:27       ` Helmut Eller
@ 2010-09-15 14:59         ` Stefan Monnier
  2010-09-15 15:09           ` Lars Magne Ingebrigtsen
  2010-09-15 15:46           ` Helmut Eller
  0 siblings, 2 replies; 97+ messages in thread
From: Stefan Monnier @ 2010-09-15 14:59 UTC (permalink / raw)
  To: Helmut Eller; +Cc: emacs-devel

>> - The main problem with Emacs regexps right now is that they have
>> pathological cases where the match-time is enormous (potentially
>> exponential explosion in the size of the input string).  To be
>> worthwhile a replacement should address this problem, which basically
>> needs it should not be based on backtracking.
> Is it possible (theoretically) to implement all of Emacs regexps without
> backtracking?  In particular those with back-references (\N) seem
> problematic.  Or is it necessary to recognize "optimizable" regexps
> before using a different regexp engine?

IIRC regexps without back-refs can be matched (and searched) in O(N)
where N is the length of the input.  With back-refs, I think (not sure)
the theoretical bound is O(N^2), which requires
a non-backtracking algorithm.

So yes, we'd need to handle back-refs specially.  Several regexp engines
do that already (they have a few different inner engines and choose
which one to use based on the particular regexp at hand).


        Stefan



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 14:59         ` Stefan Monnier
@ 2010-09-15 15:09           ` Lars Magne Ingebrigtsen
  2010-09-15 15:31             ` Andreas Schwab
  2010-09-15 15:42             ` Stefan Monnier
  2010-09-15 15:46           ` Helmut Eller
  1 sibling, 2 replies; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-15 15:09 UTC (permalink / raw)
  To: emacs-devel

This reminds me of a question I've forgotten to ask.

Many protocols that Gnus parses is such that I need to find some string
that matches the beginning of the line, and I usually do

  (re-search-forward "^foo ")

or the like.  However, many times the it's really a substring and not a
regexp, and I could say

  (search-forward "\nfoo ")

if it weren't for that not matching the first entry in a buffer.

So you end up with

(or (looking-at "foo ") (search-forward "\nfoo "))

which creates a regexp, anyway, and seems clumsy.

So what I wonder is whether there is a smarter way to do this, in
general.  (I'm assuming that a simple string search is faster than a
regexp search, but I've never actually benchmarked this.)

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 15:09           ` Lars Magne Ingebrigtsen
@ 2010-09-15 15:31             ` Andreas Schwab
  2010-09-15 15:35               ` Lars Magne Ingebrigtsen
  2010-09-15 15:42             ` Stefan Monnier
  1 sibling, 1 reply; 97+ messages in thread
From: Andreas Schwab @ 2010-09-15 15:31 UTC (permalink / raw)
  To: emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> So you end up with
>
> (or (looking-at "foo ") (search-forward "\nfoo "))
>
> which creates a regexp, anyway, and seems clumsy.
>
> So what I wonder is whether there is a smarter way to do this, in
> general.  (I'm assuming that a simple string search is faster than a
> regexp search, but I've never actually benchmarked this.)

Trivial regexp searches are already optimized to bypass the regexp
engine.  Doing a similar check in looking-at might be worthwhile.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 15:31             ` Andreas Schwab
@ 2010-09-15 15:35               ` Lars Magne Ingebrigtsen
  2010-09-15 16:28                 ` Andreas Schwab
  0 siblings, 1 reply; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-15 15:35 UTC (permalink / raw)
  To: emacs-devel

Andreas Schwab <schwab@linux-m68k.org> writes:

> Trivial regexp searches are already optimized to bypass the regexp
> engine.  Doing a similar check in looking-at might be worthwhile.

I did some trivial benchmarking with

(while (search-backward "\n(defun " nil t)))

and the equivalent re-search-backward in a buffer in a loop, and the
search-backward version was about 8x faster.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 15:35               ` Lars Magne Ingebrigtsen
@ 2010-09-15 16:28                 ` Andreas Schwab
  2010-09-16 16:57                   ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 97+ messages in thread
From: Andreas Schwab @ 2010-09-15 16:28 UTC (permalink / raw)
  To: emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> Andreas Schwab <schwab@linux-m68k.org> writes:
>
>> Trivial regexp searches are already optimized to bypass the regexp
>> engine.  Doing a similar check in looking-at might be worthwhile.
>
> I did some trivial benchmarking with
>
> (while (search-backward "\n(defun " nil t)))
>
> and the equivalent re-search-backward in a buffer in a loop, and the
> search-backward version was about 8x faster.

How did you measure that?  When I tried I did not see any significant
difference.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 16:28                 ` Andreas Schwab
@ 2010-09-16 16:57                   ` Lars Magne Ingebrigtsen
  0 siblings, 0 replies; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-16 16:57 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 550 bytes --]

Andreas Schwab <schwab@linux-m68k.org> writes:

>> I did some trivial benchmarking with
>>
>> (while (search-backward "\n(defun " nil t)))
>>
>> and the equivalent re-search-backward in a buffer in a loop, and the
>> search-backward version was about 8x faster.
>
> How did you measure that?  When I tried I did not see any significant
> difference.

I just call the stuff I want to benchmark a gazillion times with my tiny
benchmark.el package:

(benchmark 10000 (goto-char (point-min)) (while (search-forward "\n(defun " nil t) (forward-line 1)))


[-- Attachment #2: benchmark.el --]
[-- Type: application/emacs-lisp, Size: 2082 bytes --]

[-- Attachment #3: Type: text/plain, Size: 103 bytes --]


-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 15:09           ` Lars Magne Ingebrigtsen
  2010-09-15 15:31             ` Andreas Schwab
@ 2010-09-15 15:42             ` Stefan Monnier
  2010-09-15 15:51               ` Lars Magne Ingebrigtsen
  1 sibling, 1 reply; 97+ messages in thread
From: Stefan Monnier @ 2010-09-15 15:42 UTC (permalink / raw)
  To: emacs-devel

> So you end up with
> (or (looking-at "foo ") (search-forward "\nfoo "))
> which creates a regexp, anyway, and seems clumsy.

Unless the text you match is short, the above is probably the fastest, indeed.
There is no built-in support for the above idiom, OTOH, so you have to
pay for the extra Elisp interpretation overhead of calling looking-at
and then search-forward.

> So what I wonder is whether there is a smarter way to do this, in
> general.  (I'm assuming that a simple string search is faster than a
> regexp search, but I've never actually benchmarked this.)

A simple string search is indeed faster and uses one of those algorithms
that are faster for longer search strings.

        Stefan

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 15:42             ` Stefan Monnier
@ 2010-09-15 15:51               ` Lars Magne Ingebrigtsen
  2010-09-15 15:57                 ` Leo
  2010-09-16  2:57                 ` Stephen J. Turnbull
  0 siblings, 2 replies; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-15 15:51 UTC (permalink / raw)
  To: emacs-devel

Stefan Monnier <monnier@IRO.UMontreal.CA> writes:

>> So you end up with
>> (or (looking-at "foo ") (search-forward "\nfoo "))
>> which creates a regexp, anyway, and seems clumsy.
>
> Unless the text you match is short, the above is probably the fastest, indeed.
> There is no built-in support for the above idiom, OTOH, so you have to
> pay for the extra Elisp interpretation overhead of calling looking-at
> and then search-forward.

looking-at probably compiles the regexp, so there might be unnecessary
overhead there.  (The regexp compilation and caching and stuff.)

Is there any function like

(is-the-string-following-point-equal-to-this-string-p "foo ")

in Emacs that I've overlooked somehow?

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 15:51               ` Lars Magne Ingebrigtsen
@ 2010-09-15 15:57                 ` Leo
  2010-09-15 16:01                   ` Lars Magne Ingebrigtsen
  2010-09-15 16:05                   ` David Kastrup
  2010-09-16  2:57                 ` Stephen J. Turnbull
  1 sibling, 2 replies; 97+ messages in thread
From: Leo @ 2010-09-15 15:57 UTC (permalink / raw)
  To: emacs-devel

On 2010-09-15 16:51 +0100, Lars Magne Ingebrigtsen wrote:
> looking-at probably compiles the regexp, so there might be unnecessary
> overhead there.  (The regexp compilation and caching and stuff.)
>
> Is there any function like
>
> (is-the-string-following-point-equal-to-this-string-p "foo ")
>
> in Emacs that I've overlooked somehow?

Can you build one using compare-strings?

Leo




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 15:57                 ` Leo
@ 2010-09-15 16:01                   ` Lars Magne Ingebrigtsen
  2010-09-15 16:05                   ` David Kastrup
  1 sibling, 0 replies; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-15 16:01 UTC (permalink / raw)
  To: emacs-devel

Leo <sdl.web@gmail.com> writes:

> Can you build one using compare-strings?

That would be a consing operation.  It's an operation that doesn't have
to create garbage, which is nice if you're doing loops like

(while (or (following-string-p "foo ")
           (search-forward "\nfoo "))
  ...)

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 15:57                 ` Leo
  2010-09-15 16:01                   ` Lars Magne Ingebrigtsen
@ 2010-09-15 16:05                   ` David Kastrup
  2010-09-15 16:23                     ` Leo
  1 sibling, 1 reply; 97+ messages in thread
From: David Kastrup @ 2010-09-15 16:05 UTC (permalink / raw)
  To: emacs-devel

Leo <sdl.web@gmail.com> writes:

> On 2010-09-15 16:51 +0100, Lars Magne Ingebrigtsen wrote:
>> looking-at probably compiles the regexp, so there might be unnecessary
>> overhead there.  (The regexp compilation and caching and stuff.)
>>
>> Is there any function like
>>
>> (is-the-string-following-point-equal-to-this-string-p "foo ")
>>
>> in Emacs that I've overlooked somehow?
>
> Can you build one using compare-strings?

More likely compare-buffer-substrings.  It would be nicer if
compare-strings just accepted a buffer as either of its string
arguments.  Sure, one can use buffer-substring-no-properties with
compare-strings or (with-temp-buffer (insert ... with
compare-buffer-substrings, but that feels clumsy in comparison.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 16:05                   ` David Kastrup
@ 2010-09-15 16:23                     ` Leo
  2010-09-15 16:37                       ` David Kastrup
  0 siblings, 1 reply; 97+ messages in thread
From: Leo @ 2010-09-15 16:23 UTC (permalink / raw)
  To: emacs-devel

On 2010-09-15 17:05 +0100, David Kastrup wrote:
> Leo <sdl.web@gmail.com> writes:
>
>> On 2010-09-15 16:51 +0100, Lars Magne Ingebrigtsen wrote:
>>> looking-at probably compiles the regexp, so there might be unnecessary
>>> overhead there.  (The regexp compilation and caching and stuff.)
>>>
>>> Is there any function like
>>>
>>> (is-the-string-following-point-equal-to-this-string-p "foo ")
>>>
>>> in Emacs that I've overlooked somehow?
>>
>> Can you build one using compare-strings?
>
> More likely compare-buffer-substrings.  It would be nicer if
> compare-strings just accepted a buffer as either of its string
> arguments.  Sure, one can use buffer-substring-no-properties with
> compare-strings or (with-temp-buffer (insert ... with
> compare-buffer-substrings, but that feels clumsy in comparison.

These two functions look similar, any idea why not extend
compare-strings as David suggested?

Leo




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 16:23                     ` Leo
@ 2010-09-15 16:37                       ` David Kastrup
  2010-09-16 16:58                         ` Lars Magne Ingebrigtsen
  2010-09-16 17:35                         ` Lars Magne Ingebrigtsen
  0 siblings, 2 replies; 97+ messages in thread
From: David Kastrup @ 2010-09-15 16:37 UTC (permalink / raw)
  To: emacs-devel

Leo <sdl.web@gmail.com> writes:

> On 2010-09-15 17:05 +0100, David Kastrup wrote:
>> Leo <sdl.web@gmail.com> writes:
>>
>>> On 2010-09-15 16:51 +0100, Lars Magne Ingebrigtsen wrote:
>>>> looking-at probably compiles the regexp, so there might be unnecessary
>>>> overhead there.  (The regexp compilation and caching and stuff.)
>>>>
>>>> Is there any function like
>>>>
>>>> (is-the-string-following-point-equal-to-this-string-p "foo ")
>>>>
>>>> in Emacs that I've overlooked somehow?
>>>
>>> Can you build one using compare-strings?
>>
>> More likely compare-buffer-substrings.  It would be nicer if
>> compare-strings just accepted a buffer as either of its string
>> arguments.  Sure, one can use buffer-substring-no-properties with
>> compare-strings or (with-temp-buffer (insert ... with
>> compare-buffer-substrings, but that feels clumsy in comparison.
>
> These two functions look similar, any idea why not extend
> compare-strings as David suggested?

A plausible reason would be that it is not trivial to do and nobody
needed it so far.

Lars sounds like he would be better served with looking-at getting an
optional "LITERAL" argument making it do its job without involving the
regexp machinery.

Of course, he could just try something like

(search-forward string (+ (point) (length string)) t)

which should work just fine in his case.  In particular since he appears
to want to move beyond the match (if any) anyhow.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 16:37                       ` David Kastrup
@ 2010-09-16 16:58                         ` Lars Magne Ingebrigtsen
  2010-09-16 21:11                           ` Andreas Schwab
  2010-09-16 17:35                         ` Lars Magne Ingebrigtsen
  1 sibling, 1 reply; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-16 16:58 UTC (permalink / raw)
  To: emacs-devel

David Kastrup <dak@gnu.org> writes:

> Lars sounds like he would be better served with looking-at getting an
> optional "LITERAL" argument making it do its job without involving the
> regexp machinery.

Yes, a LITERAL option to `looking-at' would make sense.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-16 16:58                         ` Lars Magne Ingebrigtsen
@ 2010-09-16 21:11                           ` Andreas Schwab
  2010-09-16 23:17                             ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 97+ messages in thread
From: Andreas Schwab @ 2010-09-16 21:11 UTC (permalink / raw)
  To: emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> Lars sounds like he would be better served with looking-at getting an
>> optional "LITERAL" argument making it do its job without involving the
>> regexp machinery.
>
> Yes, a LITERAL option to `looking-at' would make sense.

IMHO it should be good enough to teach looking-at to recognize trivial
regexps.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-16 21:11                           ` Andreas Schwab
@ 2010-09-16 23:17                             ` Lars Magne Ingebrigtsen
  2010-09-17  8:13                               ` Eli Zaretskii
  0 siblings, 1 reply; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-16 23:17 UTC (permalink / raw)
  To: emacs-devel

Andreas Schwab <schwab@linux-m68k.org> writes:

> IMHO it should be good enough to teach looking-at to recognize trivial
> regexps.

That would do the trick, too.  But having the LITERALP option doesn't
exactly add a lot of code...

Unless I've misunderstood how buffers and strings work, which is a very
high possibility.  Is there an architectural overview of the Emacs
internal anywhere?

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-16 23:17                             ` Lars Magne Ingebrigtsen
@ 2010-09-17  8:13                               ` Eli Zaretskii
  2010-09-17 13:17                                 ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 97+ messages in thread
From: Eli Zaretskii @ 2010-09-17  8:13 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

> From: Lars Magne Ingebrigtsen <larsi@gnus.org>
> Date: Fri, 17 Sep 2010 01:17:05 +0200
> 
> Unless I've misunderstood how buffers and strings work, which is a very
> high possibility.

What aspects of buffers and strings you think you might not
understand?  Ask here any specific questions you have.

> Is there an architectural overview of the Emacs internal anywhere?

See the "Object Internals" node in the ELisp manual.  Buffers are
described there, but strings are not.  OTOH, a Lisp string is a fairly
simple object, so you should be able to grasp it by looking at the
definition of `struct Lisp_string' in lisp.h and how strings are
allocated and handled in alloc.c.  (There are subtleties about strings
and buffers when Emacs allocates large chunks of memory, but I don't
think those subtleties matter in the context of this discussion.)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17  8:13                               ` Eli Zaretskii
@ 2010-09-17 13:17                                 ` Lars Magne Ingebrigtsen
  2010-09-17 13:30                                   ` Eli Zaretskii
  0 siblings, 1 reply; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-17 13:17 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> What aspects of buffers and strings you think you might not
> understand?  Ask here any specific questions you have.

I mainly wonder how the text in the buffer is really represented.  Is it
like a string (which is utf8-ish), but with a gap somewhere?

>> Is there an architectural overview of the Emacs internal anywhere?
>
> See the "Object Internals" node in the ELisp manual.  Buffers are
> described there, but strings are not.

Thanks; that's a helpful node.  However, the node seems to, er, not
match up to how functions really are written.  The code is full of BEGV,
PT and CHECK_STRING, which is likely somewhat mysterious to new
hackers.  At least I was confused.  :-)

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 13:17                                 ` Lars Magne Ingebrigtsen
@ 2010-09-17 13:30                                   ` Eli Zaretskii
  2010-09-17 13:34                                     ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 97+ messages in thread
From: Eli Zaretskii @ 2010-09-17 13:30 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

> From: Lars Magne Ingebrigtsen <larsi@gnus.org>
> Date: Fri, 17 Sep 2010 15:17:37 +0200
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > What aspects of buffers and strings you think you might not
> > understand?  Ask here any specific questions you have.
> 
> I mainly wonder how the text in the buffer is really represented.  Is it
> like a string (which is utf8-ish), but with a gap somewhere?

Yes.

> Thanks; that's a helpful node.  However, the node seems to, er, not
> match up to how functions really are written.  The code is full of BEGV,
> PT and CHECK_STRING, which is likely somewhat mysterious to new
> hackers.  At least I was confused.  :-)

BEGV, PT, etc. are covered by the "Buffer Internals" node, except that
they are lowercased there (because they describe the corresponding
struct members to which the upper-cased macros expand).

CHECK_FOO just checks that the Lisp_Object is of type FOO.  This is
useful when you need to be sure you get the arguments of the right
type before you process them.



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 13:30                                   ` Eli Zaretskii
@ 2010-09-17 13:34                                     ` Lars Magne Ingebrigtsen
  0 siblings, 0 replies; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-17 13:34 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> I mainly wonder how the text in the buffer is really represented.  Is it
>> like a string (which is utf8-ish), but with a gap somewhere?
>
> Yes.

Oh, right.  Then I'll rewrite the html-parse-buffer function to use the
buffer string instead of taking a string as an argument, again.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 16:37                       ` David Kastrup
  2010-09-16 16:58                         ` Lars Magne Ingebrigtsen
@ 2010-09-16 17:35                         ` Lars Magne Ingebrigtsen
  1 sibling, 0 replies; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-16 17:35 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 285 bytes --]

David Kastrup <dak@gnu.org> writes:

> Lars sounds like he would be better served with looking-at getting an
> optional "LITERAL" argument making it do its job without involving the
> regexp machinery.

Here's a quick take on how an optional LITERALP option to looking-at
might look:


[-- Attachment #2: literal --]
[-- Type: application/octet-stream, Size: 2806 bytes --]

=== modified file 'src/lisp.h'
*** src/lisp.h	2010-09-12 14:35:37 +0000
--- src/lisp.h	2010-09-16 17:27:50 +0000
***************
*** 3114,3120 ****
  EXFUN (Fmatch_beginning, 1);
  EXFUN (Fmatch_end, 1);
  extern void record_unwind_save_match_data (void);
! EXFUN (Flooking_at, 1);
  extern int fast_string_match (Lisp_Object, Lisp_Object);
  extern int fast_c_string_match_ignore_case (Lisp_Object, const char *);
  extern int fast_string_match_ignore_case (Lisp_Object, Lisp_Object);
--- 3114,3120 ----
  EXFUN (Fmatch_beginning, 1);
  EXFUN (Fmatch_end, 1);
  extern void record_unwind_save_match_data (void);
! EXFUN (Flooking_at, 2);
  extern int fast_string_match (Lisp_Object, Lisp_Object);
  extern int fast_c_string_match_ignore_case (Lisp_Object, const char *);
  extern int fast_string_match_ignore_case (Lisp_Object, Lisp_Object);

=== modified file 'src/search.c'
*** src/search.c	2010-08-09 09:35:21 +0000
--- src/search.c	2010-09-16 17:27:35 +0000
***************
*** 281,286 ****
--- 281,309 ----
  
  \f
  static Lisp_Object
+ looking_at_literally (Lisp_Object string)
+ {
+   int start_byte = CHAR_TO_BYTE (PT);
+   int end_byte, end = PT + SCHARS (string);
+ 
+   CHECK_STRING (string);
+ 
+   if (PT < GPT && GPT < end)
+     move_gap (PT);
+ 
+   if (end > ZV)
+     return Qnil;
+   
+   end_byte = CHAR_TO_BYTE (end);
+ 
+   if (! memcmp (SDATA (string), BYTE_POS_ADDR (start_byte),
+ 		end_byte - start_byte))
+     return Qt;
+   else
+     return Qnil;
+ }
+ 
+ static Lisp_Object
  looking_at_1 (Lisp_Object string, int posix)
  {
    Lisp_Object val;
***************
*** 357,370 ****
    return val;
  }
  
! DEFUN ("looking-at", Flooking_at, Slooking_at, 1, 1, 0,
         doc: /* Return t if text after point matches regular expression REGEXP.
  This function modifies the match data that `match-beginning',
  `match-end' and `match-data' access; save and restore the match
! data if you want to preserve them.  */)
!   (Lisp_Object regexp)
  {
!   return looking_at_1 (regexp, 0);
  }
  
  DEFUN ("posix-looking-at", Fposix_looking_at, Sposix_looking_at, 1, 1, 0,
--- 380,398 ----
    return val;
  }
  
! DEFUN ("looking-at", Flooking_at, Slooking_at, 1, 2, 0,
         doc: /* Return t if text after point matches regular expression REGEXP.
  This function modifies the match data that `match-beginning',
  `match-end' and `match-data' access; save and restore the match
! data if you want to preserve them.
! If LITERALP is non-nil, REGEXP will be interpreted as a string, and the
! match data will not be modified.  */)
!   (Lisp_Object regexp, Lisp_Object literalp)
  {
!   if (NILP (literalp))
!     return looking_at_literally (regexp);
!   else
!     return looking_at_1 (regexp, 0);
  }
  
  DEFUN ("posix-looking-at", Fposix_looking_at, Sposix_looking_at, 1, 1, 0,


[-- Attachment #3: Type: text/plain, Size: 313 bytes --]


Benchmarking shows the expected -- that this is a *lot* faster than the
alternatives.  A
(while (or (looking-at "thing" t) (search-forward "\nthing")))
loop is about 5% slower than a pure search-forward loop.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 15:51               ` Lars Magne Ingebrigtsen
  2010-09-15 15:57                 ` Leo
@ 2010-09-16  2:57                 ` Stephen J. Turnbull
  2010-09-16  6:54                   ` David Kastrup
  2010-09-16 17:01                   ` Lars Magne Ingebrigtsen
  1 sibling, 2 replies; 97+ messages in thread
From: Stephen J. Turnbull @ 2010-09-16  2:57 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

Lars Magne Ingebrigtsen writes:

 > Is there any function like
 > 
 > (is-the-string-following-point-equal-to-this-string-p "foo ")

Does every one-line function need to be a built-in?

(defun is-the-string-following-point-equal-to-this-string-p (s)
  (string= s (buffer-substring (point) (+ (point) (length s)))))

or

(defun is-the-string-following-point-equal-to-this-string-p (s)
  (search-forward s (+ (point) (length s)) t))




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-16  2:57                 ` Stephen J. Turnbull
@ 2010-09-16  6:54                   ` David Kastrup
  2010-09-16  8:10                     ` Stephen J. Turnbull
  2010-09-16 17:01                   ` Lars Magne Ingebrigtsen
  1 sibling, 1 reply; 97+ messages in thread
From: David Kastrup @ 2010-09-16  6:54 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Lars Magne Ingebrigtsen writes:
>
>  > Is there any function like
>  > 
>  > (is-the-string-following-point-equal-to-this-string-p "foo ")
>
> Does every one-line function need to be a built-in?
>
> (defun is-the-string-following-point-equal-to-this-string-p (s)
>   (string= s (buffer-substring (point) (+ (point) (length s)))))
>
> or
>
> (defun is-the-string-following-point-equal-to-this-string-p (s)
>   (search-forward s (+ (point) (length s)) t))

The former will signal an error when the string is longer than the rest
of the buffer.  The latter won't.

You can't figure this out by reading the doc strings of the used
functions.  You have to read their source code.

Since a user is not likely to pick the correct one-liner, it might make
sense to define a function for that.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-16  6:54                   ` David Kastrup
@ 2010-09-16  8:10                     ` Stephen J. Turnbull
  2010-09-16  8:31                       ` David Kastrup
  0 siblings, 1 reply; 97+ messages in thread
From: Stephen J. Turnbull @ 2010-09-16  8:10 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup writes:

 > > (defun is-the-string-following-point-equal-to-this-string-p (s)
 > >   (string= s (buffer-substring (point) (+ (point) (length s)))))
 > >
 > > or
 > >
 > > (defun is-the-string-following-point-equal-to-this-string-p (s)
 > >   (search-forward s (+ (point) (length s)) t))

 > The former will signal an error when the string is longer than the
 > rest of the buffer.  The latter won't.
 > 
 > You can't figure this out by reading the doc strings of the used
 > functions.  You have to read their source code.
 > 
 > Since a user is not likely to pick the correct one-liner, it might
 > make sense to define a function for that.

It might, but I would prefer a documentation patch. ;-)



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-16  8:10                     ` Stephen J. Turnbull
@ 2010-09-16  8:31                       ` David Kastrup
  0 siblings, 0 replies; 97+ messages in thread
From: David Kastrup @ 2010-09-16  8:31 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> David Kastrup writes:
>
>  > > (defun is-the-string-following-point-equal-to-this-string-p (s)
>  > >   (string= s (buffer-substring (point) (+ (point) (length s)))))
>  > >
>  > > or
>  > >
>  > > (defun is-the-string-following-point-equal-to-this-string-p (s)
>  > >   (search-forward s (+ (point) (length s)) t))
>
>  > The former will signal an error when the string is longer than the
>  > rest of the buffer.  The latter won't.
>  > 
>  > You can't figure this out by reading the doc strings of the used
>  > functions.  You have to read their source code.
>  > 
>  > Since a user is not likely to pick the correct one-liner, it might
>  > make sense to define a function for that.
>
> It might, but I would prefer a documentation patch. ;-)

Can't really do the trick here.

"If START or END are outside the accessible buffer region, an error
is signaled.  In case you wanted to check for the presence of a string
str at buffer point, consider using `search-forward' with a LIMIT
of (+ (length str) (point)) and a NOERROR of t.

Or use
(condition-case nil
   (string= s (buffer-substring (point) (+ (point) (length s))))
  (args-out-of-range nil))

Or maybe
(string= s (buffer-substring (point) (min (point-max) (+ (point) (length s)))))
"

I mean, get real.  One can't write suggestions for all possible intended
uses into a DOC string.

Would make more sense to give buffer-substring a NOERROR argument of its
own.  Still ugh.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-16  2:57                 ` Stephen J. Turnbull
  2010-09-16  6:54                   ` David Kastrup
@ 2010-09-16 17:01                   ` Lars Magne Ingebrigtsen
  2010-09-17  6:52                     ` Stephen J. Turnbull
  1 sibling, 1 reply; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-16 17:01 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Does every one-line function need to be a built-in?
>
> (defun is-the-string-following-point-equal-to-this-string-p (s)
>   (string= s (buffer-substring (point) (+ (point) (length s)))))
>
> or
>
> (defun is-the-string-following-point-equal-to-this-string-p (s)
>   (search-forward s (+ (point) (length s)) t))

Does everything have to be slow?  :-)

The first one (in addition to not really working all that well) makes
a benchmark of

(benchmark 10000 (goto-char (point-min)) (while (or (s (buffer-substring (point) (+ (point) (length "(defun "))) "(defun ") (search-forward "\n(defun " nil t)) (forward-line 1)))

take 24 seconds, while the non-consing version takes 9 seconds.  The
latter version is only 50% slower than the single-search-forward
version.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-16 17:01                   ` Lars Magne Ingebrigtsen
@ 2010-09-17  6:52                     ` Stephen J. Turnbull
  2010-09-17 13:09                       ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 97+ messages in thread
From: Stephen J. Turnbull @ 2010-09-17  6:52 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

Lars Magne Ingebrigtsen writes:

 > Does everything have to be slow?  :-)

Wrong polarity.  The question here is "does everything have to be
fast?", and the answer is "no -- I mean, HELL NO!"

Make your case that this function needs to be faster than a pure elisp
implementation.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17  6:52                     ` Stephen J. Turnbull
@ 2010-09-17 13:09                       ` Lars Magne Ingebrigtsen
  2010-09-17 13:31                         ` David Kastrup
  2010-09-17 17:40                         ` Stephen J. Turnbull
  0 siblings, 2 replies; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-17 13:09 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

>  > Does everything have to be slow?  :-)
>
> Wrong polarity.  The question here is "does everything have to be
> fast?", and the answer is "no -- I mean, HELL NO!"
>
> Make your case that this function needs to be faster than a pure elisp
> implementation.

I think that's a rather ... stunning approach, but might explain many
things about XEmacs.

I've presented a use case, and I've demonstrated how all the alternative
implementations are 50-150% slower than my suggested new implementation,
and I've done the implementation, which turned out to be totally
trivial.  What more do you need?

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 13:09                       ` Lars Magne Ingebrigtsen
@ 2010-09-17 13:31                         ` David Kastrup
  2010-09-17 13:39                           ` Lars Magne Ingebrigtsen
  2010-09-17 13:49                           ` Andreas Schwab
  2010-09-17 17:40                         ` Stephen J. Turnbull
  1 sibling, 2 replies; 97+ messages in thread
From: David Kastrup @ 2010-09-17 13:31 UTC (permalink / raw)
  To: emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>
>>  > Does everything have to be slow?  :-)
>>
>> Wrong polarity.  The question here is "does everything have to be
>> fast?", and the answer is "no -- I mean, HELL NO!"
>>
>> Make your case that this function needs to be faster than a pure elisp
>> implementation.
>
> I think that's a rather ... stunning approach, but might explain many
> things about XEmacs.
>
> I've presented a use case, and I've demonstrated how all the alternative
> implementations are 50-150% slower than my suggested new implementation,
> and I've done the implementation, which turned out to be totally
> trivial.

Not really sure about that.

+ looking_at_literally (Lisp_Object string)
+ {
+   int start_byte = CHAR_TO_BYTE (PT);
+   int end_byte, end = PT + SCHARS (string);

PT + SCHARS (string) can overflow here.  Better check first rather than
later whether ZV - PT < SCHARS (string).

Yes, I know that most-positive-fixnum <= MAX_INT/2, but just on
principle.

+   end_byte = CHAR_TO_BYTE (end);
+ 
+   if (! memcmp (SDATA (string), BYTE_POS_ADDR (start_byte),
+ 		end_byte - start_byte))

That is assuming that both string and buffer are identically encoded
(nowadays that likely means both have the same multibyteness).

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 13:31                         ` David Kastrup
@ 2010-09-17 13:39                           ` Lars Magne Ingebrigtsen
  2010-09-17 13:55                             ` David Kastrup
  2010-09-17 13:49                           ` Andreas Schwab
  1 sibling, 1 reply; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-17 13:39 UTC (permalink / raw)
  To: emacs-devel

David Kastrup <dak@gnu.org> writes:

> +   end_byte = CHAR_TO_BYTE (end);
> + 
> +   if (! memcmp (SDATA (string), BYTE_POS_ADDR (start_byte),
> + 		end_byte - start_byte))
>
> That is assuming that both string and buffer are identically encoded
> (nowadays that likely means both have the same multibyteness).

Which brings me back to the other question I had, about buffer
internals.  :-)

It was a guess based on Fbuffer_substring doing this:

  memcpy (SDATA (result), BYTE_POS_ADDR (start_byte), end_byte - start_byte);

So I thought that if you could create a string by just memcpy()-ing data
from the buffer, then it seemed likely that you could compare them with
memcmp().  But that's probably wrong?  

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 13:39                           ` Lars Magne Ingebrigtsen
@ 2010-09-17 13:55                             ` David Kastrup
  2010-09-17 14:18                               ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 97+ messages in thread
From: David Kastrup @ 2010-09-17 13:55 UTC (permalink / raw)
  To: emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> +   end_byte = CHAR_TO_BYTE (end);
>> + 
>> +   if (! memcmp (SDATA (string), BYTE_POS_ADDR (start_byte),
>> + 		end_byte - start_byte))
>>
>> That is assuming that both string and buffer are identically encoded
>> (nowadays that likely means both have the same multibyteness).
>
> Which brings me back to the other question I had, about buffer
> internals.  :-)
>
> It was a guess based on Fbuffer_substring doing this:
>
>   memcpy (SDATA (result), BYTE_POS_ADDR (start_byte), end_byte - start_byte);
>
> So I thought that if you could create a string by just memcpy()-ing data
> from the buffer, then it seemed likely that you could compare them with
> memcmp().  But that's probably wrong?

Fbuffer_substring also likely copies the multibyteness, not just the
bytes.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 13:55                             ` David Kastrup
@ 2010-09-17 14:18                               ` Lars Magne Ingebrigtsen
  2010-09-17 14:57                                 ` David Kastrup
  2010-09-17 16:18                                 ` Eli Zaretskii
  0 siblings, 2 replies; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-17 14:18 UTC (permalink / raw)
  To: emacs-devel

David Kastrup <dak@gnu.org> writes:

> Fbuffer_substring also likely copies the multibyteness, not just the
> bytes.

Yes, probably, but I don't know what that means.

The code, in more fullness, is:

  if (start < GPT && GPT < end)
    move_gap (start);

  if (! NILP (current_buffer->enable_multibyte_characters))
    result = make_uninit_multibyte_string (end - start, end_byte - start_byte);
  else
    result = make_uninit_string (end - start);
  memcpy (SDATA (result), BYTE_POS_ADDR (start_byte), end_byte - start_byte);

So no matter whether it creates a "multibyte string" or not (and I don't
know what the difference is), it still just does a memcpy from the
buffer representation over to the string representation.

But I've now read "33.1 Text Representations", and I think I understand
a bit better.  If you have a unibyte string with the bytes

#x68 #xe9 #x6c #x6c #x6f

in it, and you have a multibyte buffer with the string

héllo

in it, then they won't match.  But are they supposed to?

(equal (unibyte-string #x68 #xe9 #x6c #x6c #x6f) "héllo")
=> nil

I guess not.

So I'm thinking the memcmp() is sufficient to give the desired result.
Isn't it?

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 14:18                               ` Lars Magne Ingebrigtsen
@ 2010-09-17 14:57                                 ` David Kastrup
  2010-09-17 15:06                                   ` Lars Magne Ingebrigtsen
  2010-09-17 16:18                                 ` Eli Zaretskii
  1 sibling, 1 reply; 97+ messages in thread
From: David Kastrup @ 2010-09-17 14:57 UTC (permalink / raw)
  To: emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> Fbuffer_substring also likely copies the multibyteness, not just the
>> bytes.
>
> Yes, probably, but I don't know what that means.

This:

>   if (! NILP (current_buffer->enable_multibyte_characters))
>     result = make_uninit_multibyte_string (end - start, end_byte - start_byte);
>   else
>     result = make_uninit_string (end - start);


> So no matter whether it creates a "multibyte string" or not (and I
> don't know what the difference is), it still just does a memcpy from
> the buffer representation over to the string representation.

Sure.  But when you compare you can be in the unfortunate situation that
multibyteness _differs_.

In this case, the byte patterns can be the same and the texts still
different, and vice versa.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 14:57                                 ` David Kastrup
@ 2010-09-17 15:06                                   ` Lars Magne Ingebrigtsen
  2010-09-17 15:24                                     ` Lars Magne Ingebrigtsen
  2010-09-17 16:11                                     ` David Kastrup
  0 siblings, 2 replies; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-17 15:06 UTC (permalink / raw)
  To: emacs-devel

David Kastrup <dak@gnu.org> writes:

> In this case, the byte patterns can be the same and the texts still
> different, and vice versa.

So just to be clear here, you think this is correct:

(equalp (unibyte-string #x68 #xe9 #x6c #x6c #x6f) "héllo")
=> nil

But that this behaviour is incorrect?

(equalp (unibyte-string #x68 #x6c #x6c #x6f) "hllo")
=> t

But that this is correct:

(equalp (unibyte-string #x68 #xc3 #xa9 #x6c #x6c #x6f) "héllo")
=> nil

And this is incorrect?

(looking-at (unibyte-string #x68 #xc3 #xa9 #x6c #x6c #x6f) t)héllo
=> t

*scratches head*  Well, I guess I can see that...

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 15:06                                   ` Lars Magne Ingebrigtsen
@ 2010-09-17 15:24                                     ` Lars Magne Ingebrigtsen
  2010-09-17 16:11                                       ` Eli Zaretskii
  2010-09-17 16:33                                       ` David Kastrup
  2010-09-17 16:11                                     ` David Kastrup
  1 sibling, 2 replies; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-17 15:24 UTC (permalink / raw)
  To: emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> (looking-at (unibyte-string #x68 #xc3 #xa9 #x6c #x6c #x6f) t)héllo
> => t
>
> *scratches head*  Well, I guess I can see that...

I've looked at what Fequal does.  It first compares the number of
characters in the strings, and then the number of bytes, and then it
does a memcmp().  So the only change necessary in at_literal() was to
add a check that the number of bytes in the string and in the region
we're comparing with is the same.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 15:24                                     ` Lars Magne Ingebrigtsen
@ 2010-09-17 16:11                                       ` Eli Zaretskii
  2010-09-17 16:33                                       ` David Kastrup
  1 sibling, 0 replies; 97+ messages in thread
From: Eli Zaretskii @ 2010-09-17 16:11 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

> From: Lars Magne Ingebrigtsen <larsi@gnus.org>
> Date: Fri, 17 Sep 2010 17:24:34 +0200
> 
> Lars Magne Ingebrigtsen <larsi@gnus.org> writes:
> 
> > (looking-at (unibyte-string #x68 #xc3 #xa9 #x6c #x6c #x6f) t)héllo
> > => t
> >
> > *scratches head*  Well, I guess I can see that...
> 
> I've looked at what Fequal does.  It first compares the number of
> characters in the strings, and then the number of bytes, and then it
> does a memcmp().  So the only change necessary in at_literal() was to
> add a check that the number of bytes in the string and in the region
> we're comparing with is the same.

I don't recommend to get your hands dirty with unibyte strings and
unibyte buffers.  Just do it right for multibyte ones and wait until
someone hollers.




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 15:24                                     ` Lars Magne Ingebrigtsen
  2010-09-17 16:11                                       ` Eli Zaretskii
@ 2010-09-17 16:33                                       ` David Kastrup
  2010-09-17 16:41                                         ` Andreas Schwab
  2010-09-17 17:24                                         ` Lars Magne Ingebrigtsen
  1 sibling, 2 replies; 97+ messages in thread
From: David Kastrup @ 2010-09-17 16:33 UTC (permalink / raw)
  To: emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> Lars Magne Ingebrigtsen <larsi@gnus.org> writes:
>
>> (looking-at (unibyte-string #x68 #xc3 #xa9 #x6c #x6c #x6f) t)héllo
>> => t
>>
>> *scratches head*  Well, I guess I can see that...
>
> I've looked at what Fequal does.  It first compares the number of
> characters in the strings, and then the number of bytes, and then it
> does a memcmp().

It also checks that the multibyteness is the same IIRC.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 16:33                                       ` David Kastrup
@ 2010-09-17 16:41                                         ` Andreas Schwab
  2010-09-17 17:17                                           ` David Kastrup
  2010-09-17 17:24                                         ` Lars Magne Ingebrigtsen
  1 sibling, 1 reply; 97+ messages in thread
From: Andreas Schwab @ 2010-09-17 16:41 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup <dak@gnu.org> writes:

> Lars Magne Ingebrigtsen <larsi@gnus.org> writes:
>
>> Lars Magne Ingebrigtsen <larsi@gnus.org> writes:
>>
>>> (looking-at (unibyte-string #x68 #xc3 #xa9 #x6c #x6c #x6f) t)héllo
>>> => t
>>>
>>> *scratches head*  Well, I guess I can see that...
>>
>> I've looked at what Fequal does.  It first compares the number of
>> characters in the strings, and then the number of bytes, and then it
>> does a memcmp().
>
> It also checks that the multibyteness is the same IIRC.

If both the number of chars and bytes agree then they must be of the
same multibyteness.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 16:41                                         ` Andreas Schwab
@ 2010-09-17 17:17                                           ` David Kastrup
  2010-09-17 18:24                                             ` David Kastrup
  2010-09-17 18:53                                             ` Stephen J. Turnbull
  0 siblings, 2 replies; 97+ messages in thread
From: David Kastrup @ 2010-09-17 17:17 UTC (permalink / raw)
  To: emacs-devel

Andreas Schwab <schwab@linux-m68k.org> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> Lars Magne Ingebrigtsen <larsi@gnus.org> writes:
>>
>>> Lars Magne Ingebrigtsen <larsi@gnus.org> writes:
>>>
>>>> (looking-at (unibyte-string #x68 #xc3 #xa9 #x6c #x6c #x6f) t)héllo
>>>> => t
>>>>
>>>> *scratches head*  Well, I guess I can see that...
>>>
>>> I've looked at what Fequal does.  It first compares the number of
>>> characters in the strings, and then the number of bytes, and then it
>>> does a memcmp().
>>
>> It also checks that the multibyteness is the same IIRC.
>
> If both the number of chars and bytes agree then they must be of the
> same multibyteness.

Why?

(length (string-as-multibyte "h\251llo")) => 5

(equal "h\251llo" (string-as-multibyte "h\251llo")) => nil

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 17:17                                           ` David Kastrup
@ 2010-09-17 18:24                                             ` David Kastrup
  2010-09-17 20:30                                               ` David Kastrup
  2010-09-17 18:53                                             ` Stephen J. Turnbull
  1 sibling, 1 reply; 97+ messages in thread
From: David Kastrup @ 2010-09-17 18:24 UTC (permalink / raw)
  To: emacs-devel

David Kastrup <dak@gnu.org> writes:

> Andreas Schwab <schwab@linux-m68k.org> writes:
>
>> David Kastrup <dak@gnu.org> writes:
>>
>>> Lars Magne Ingebrigtsen <larsi@gnus.org> writes:
>>>
>>>> Lars Magne Ingebrigtsen <larsi@gnus.org> writes:
>>>>
>>>>> (looking-at (unibyte-string #x68 #xc3 #xa9 #x6c #x6c #x6f) t)héllo
>>>>> => t
>>>>>
>>>>> *scratches head*  Well, I guess I can see that...
>>>>
>>>> I've looked at what Fequal does.  It first compares the number of
>>>> characters in the strings, and then the number of bytes, and then it
>>>> does a memcmp().
>>>
>>> It also checks that the multibyteness is the same IIRC.
>>
>> If both the number of chars and bytes agree then they must be of the
>> same multibyteness.
>
> Why?
>
> (length (string-as-multibyte "h\251llo")) => 5
>
> (equal "h\251llo" (string-as-multibyte "h\251llo")) => nil

Just seen that string-as-multibyte actually _does_ convert the content
(did not use to do so pre-23).

So string equality indeed does not check explicitly for multibyteness
which is sort of ugly:

(equal "abc" (string-to-multibyte "abc")) => t
(string= "abc" (string-to-multibyte "abc")) => t
(multibyte-string-p "abc") => nil
(multibyte-string-p (string-to-multibyte "abc")) => t

So it is likely that the correct way to deal with the uni/multibyte
issue in the case of buffer/string comparison is the same: just check
that both character and byte counts are identical.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 18:24                                             ` David Kastrup
@ 2010-09-17 20:30                                               ` David Kastrup
  2010-09-17 20:49                                                 ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 97+ messages in thread
From: David Kastrup @ 2010-09-17 20:30 UTC (permalink / raw)
  To: emacs-devel

David Kastrup <dak@gnu.org> writes:

> So it is likely that the correct way to deal with the uni/multibyte
> issue in the case of buffer/string comparison is the same: just check
> that both character and byte counts are identical.

Oh, we are talking about `looking-at'.  Did I mention that we should
take case-fold-search into account?

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 20:30                                               ` David Kastrup
@ 2010-09-17 20:49                                                 ` Lars Magne Ingebrigtsen
  2010-09-18  4:31                                                   ` David Kastrup
  0 siblings, 1 reply; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-17 20:49 UTC (permalink / raw)
  To: emacs-devel

David Kastrup <dak@gnu.org> writes:

> Oh, we are talking about `looking-at'.  Did I mention that we should
> take case-fold-search into account?

Or we could say that LITERALP literally means literal.  :-)

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 20:49                                                 ` Lars Magne Ingebrigtsen
@ 2010-09-18  4:31                                                   ` David Kastrup
  0 siblings, 0 replies; 97+ messages in thread
From: David Kastrup @ 2010-09-18  4:31 UTC (permalink / raw)
  To: emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> Oh, we are talking about `looking-at'.  Did I mention that we should
>> take case-fold-search into account?
>
> Or we could say that LITERALP literally means literal.  :-)

No point in creating an API different to the other search functions.
Come to think of it, all of them have separate names rather than a
LITERAL argument.  It is just replace-match that has LITERAL (as well as
an explicit FIXEDCASE).  And all of them set match-data, apparently.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 17:17                                           ` David Kastrup
  2010-09-17 18:24                                             ` David Kastrup
@ 2010-09-17 18:53                                             ` Stephen J. Turnbull
  2010-09-17 20:57                                               ` Eli Zaretskii
  1 sibling, 1 reply; 97+ messages in thread
From: Stephen J. Turnbull @ 2010-09-17 18:53 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup writes:

 > > If both the number of chars and bytes agree then they must be of
 > > the same multibyteness.
 > 
 > Why?

Actually, there's an exceptional case: if both strings are pure ASCII.
In that case it might be possible that one string is multibyte and the
other unibyte, while the numbers of characters and of bytes are equal.
However, in that case the two strings have the same semantics, so I
would suspect that allowing them to compare equal if their
representations are equal (ignoring multi-byte-ness) is intentional.

The example you gave proves nothing, however.  In fact, when that
string is presented by `string-as-multibyte', ?\351 will be converted
to a private space character in Unicode and therefore will have more
than one byte in its representation.  Thus the length in bytes of the
string (as multibyte) will be 7 (or maybe more, I forget which private
space naked bytes live in).  Here's one way to get byte length of a
string:

(defun string-byte-count (s)
  (length (if (string-multibyte-p s) (encode-coding-string s 'utf-8) s)))

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 18:53                                             ` Stephen J. Turnbull
@ 2010-09-17 20:57                                               ` Eli Zaretskii
  2010-09-18 14:19                                                 ` Stephen J. Turnbull
  0 siblings, 1 reply; 97+ messages in thread
From: Eli Zaretskii @ 2010-09-17 20:57 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dak, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Date: Sat, 18 Sep 2010 03:53:27 +0900
> Cc: emacs-devel@gnu.org
> 
> Actually, there's an exceptional case: if both strings are pure ASCII.
> In that case it might be possible that one string is multibyte and the
> other unibyte, while the numbers of characters and of bytes are equal.

A unibyte string in Emacs has its `size_byte' member set to a negative
value:

    /* Mark STR as a unibyte string.  */
    #define STRING_SET_UNIBYTE(STR)  \
      do { if (EQ (STR, empty_multibyte_string))  \
	  (STR) = empty_unibyte_string;  \
	else XSTRING (STR)->size_byte = -1; } while (0)

By contrast, a multibyte string holds there the number of bytes in its
internal representation.  So a pure ASCII string could be unibyte or
multibyte, and the `size_byte' member will be negative in the former
case and positive in the latter case.

However, AFAIK Emacs always makes a unibyte string if all the
characters are pure ASCII.  So this does not matter in practice.

> The example you gave proves nothing, however.  In fact, when that
> string is presented by `string-as-multibyte', ?\351 will be converted
> to a private space character in Unicode and therefore will have more
> than one byte in its representation.  Thus the length in bytes of the
> string (as multibyte) will be 7 (or maybe more, I forget which private
> space naked bytes live in).  Here's one way to get byte length of a
> string:
> 
> (defun string-byte-count (s)
>   (length (if (string-multibyte-p s) (encode-coding-string s 'utf-8) s)))

See above: this is not accurate.



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 20:57                                               ` Eli Zaretskii
@ 2010-09-18 14:19                                                 ` Stephen J. Turnbull
  2010-09-18 15:46                                                   ` Eli Zaretskii
  2010-09-18 15:58                                                   ` Stefan Monnier
  0 siblings, 2 replies; 97+ messages in thread
From: Stephen J. Turnbull @ 2010-09-18 14:19 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, emacs-devel

Eli Zaretskii writes:

 > However, AFAIK Emacs always makes a unibyte string if all the
 > characters are pure ASCII.  So this does not matter in practice.

That's true in Stefan's personal Emacs, AIUI, but otherwise I bet aset
on a multibyte string can turn it into pure ASCII.

 > > (defun string-byte-count (s)
 > >   (length (if (string-multibyte-p s) (encode-coding-string s
 > >   'utf-8) s)))
 > 
 > See above: this is not accurate.

I don't understand what "above" you're referring to.  Unless maybe you
mean because in unibyte strings the byte count is negative?



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-18 14:19                                                 ` Stephen J. Turnbull
@ 2010-09-18 15:46                                                   ` Eli Zaretskii
  2010-09-18 15:58                                                   ` Stefan Monnier
  1 sibling, 0 replies; 97+ messages in thread
From: Eli Zaretskii @ 2010-09-18 15:46 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dak, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: dak@gnu.org,
>     emacs-devel@gnu.org
> Date: Sat, 18 Sep 2010 23:19:11 +0900
> 
>  > > (defun string-byte-count (s)
>  > >   (length (if (string-multibyte-p s) (encode-coding-string s
>  > >   'utf-8) s)))
>  > 
>  > See above: this is not accurate.
> 
> I don't understand what "above" you're referring to.

The definition of STRING_SET_UNIBYTE.

> Unless maybe you mean because in unibyte strings the byte count is
> negative?

Yes.



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-18 14:19                                                 ` Stephen J. Turnbull
  2010-09-18 15:46                                                   ` Eli Zaretskii
@ 2010-09-18 15:58                                                   ` Stefan Monnier
  1 sibling, 0 replies; 97+ messages in thread
From: Stefan Monnier @ 2010-09-18 15:58 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, dak, emacs-devel

>> However, AFAIK Emacs always makes a unibyte string if all the
>> characters are pure ASCII.  So this does not matter in practice.
> That's true in Stefan's personal Emacs, AIUI, but otherwise I bet aset
> on a multibyte string can turn it into pure ASCII.

Actually, it's not true in my personal branch because I use there some
(failed-)experimental code which distinguishes between
"unibyte/multibyte/anybyte" where "anybyte" means "could be either" and
is used for purely ascii strings (or more specifically, it's used for
those strings where the unibyte and multibyte representation are the
same).
It might have been a good idea, but it seems this is not worth the
trouble changing now.  So all that's left from this experiment is that
when I see problem with multibyte/unibyte I absolutely need to try it
with the trunk code, because it may just be a bug in my branch.

        Stefan

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 16:33                                       ` David Kastrup
  2010-09-17 16:41                                         ` Andreas Schwab
@ 2010-09-17 17:24                                         ` Lars Magne Ingebrigtsen
  1 sibling, 0 replies; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-17 17:24 UTC (permalink / raw)
  To: emacs-devel

David Kastrup <dak@gnu.org> writes:

> It also checks that the multibyteness is the same IIRC.

It's possible, but I can't find that in the code.  I think this is the
relevant bit in internal_equal():

    case Lisp_String:
      if (SCHARS (o1) != SCHARS (o2))
	return 0;
      if (SBYTES (o1) != SBYTES (o2))
	return 0;
      if (memcmp (SDATA (o1), SDATA (o2), SBYTES (o1)))
	return 0;
      if (props && !compare_string_intervals (o1, o2))
	return 0;
      return 1;


-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 15:06                                   ` Lars Magne Ingebrigtsen
  2010-09-17 15:24                                     ` Lars Magne Ingebrigtsen
@ 2010-09-17 16:11                                     ` David Kastrup
  1 sibling, 0 replies; 97+ messages in thread
From: David Kastrup @ 2010-09-17 16:11 UTC (permalink / raw)
  To: emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> In this case, the byte patterns can be the same and the texts still
>> different, and vice versa.
>
> So just to be clear here, you think this is correct:
>
> (equalp (unibyte-string #x68 #xe9 #x6c #x6c #x6f) "héllo")
> => nil

equalp does not exist.  I'll change to equal for the rest of the
discussion.

> But that this behaviour is incorrect?
>
> (equal (unibyte-string #x68 #x6c #x6c #x6f) "hllo")
> => t

No.  Correct.

> But that this is correct:
>
> (equal (unibyte-string #x68 #xc3 #xa9 #x6c #x6c #x6f) "héllo")
> => nil

Still correct.  But notice that
(equal
   (string-as-multibyte (unibyte-string #x68 #xc3 #xa9 #x6c #x6c #x6f)) "héllo")
=> t

And notice

(equal (unibyte-string #x68 #xe9 #x6c #x6c #x6f) "h\351llo")
=> t
whereas (equal "h\351llo" "héllo") => nil
while (equal "h\u00e9llo" "héllo") => t

It should probably be called an error that the print form of "h\351llo"
is "héllo" on an utf-8 terminal.  That does not particularly help to
make things less confusing since the input forms are non-equivalent.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 14:18                               ` Lars Magne Ingebrigtsen
  2010-09-17 14:57                                 ` David Kastrup
@ 2010-09-17 16:18                                 ` Eli Zaretskii
  2010-09-17 16:24                                   ` Lars Magne Ingebrigtsen
  1 sibling, 1 reply; 97+ messages in thread
From: Eli Zaretskii @ 2010-09-17 16:18 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

> From: Lars Magne Ingebrigtsen <larsi@gnus.org>
> Date: Fri, 17 Sep 2010 16:18:30 +0200
> 
> So I'm thinking the memcmp() is sufficient to give the desired result.
> Isn't it?

No, because of the gap issue.  That's why Fbuffer_substring moves the
gap out of its way:

  if (start < GPT && GPT < end)
    move_gap (start);

This ensures that the region of buffer text between START and END is
contiguous, without the gap.  Which is why you can really use memcpy
et al.



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 16:18                                 ` Eli Zaretskii
@ 2010-09-17 16:24                                   ` Lars Magne Ingebrigtsen
  2010-09-17 16:39                                     ` Eli Zaretskii
  2010-09-17 16:39                                     ` Eli Zaretskii
  0 siblings, 2 replies; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-17 16:24 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> No, because of the gap issue.  That's why Fbuffer_substring moves the
> gap out of its way:
>
>   if (start < GPT && GPT < end)
>     move_gap (start);

Yeah, I do that to in at_literal().

The thing I'm must unclear on now is whether it's valid to say

int thing = PT;

and whether it's valid to say

PT + 1;

I've grepped through the code, and this seems to be used all over the
place, so I'm guessing that perhaps the size of a buffer is constrained
to be less than INT_MAX?  Even though PT is EMACS_INT.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 16:24                                   ` Lars Magne Ingebrigtsen
@ 2010-09-17 16:39                                     ` Eli Zaretskii
  2010-09-17 17:30                                       ` Lars Magne Ingebrigtsen
  2010-09-17 16:39                                     ` Eli Zaretskii
  1 sibling, 1 reply; 97+ messages in thread
From: Eli Zaretskii @ 2010-09-17 16:39 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

> From: Lars Magne Ingebrigtsen <larsi@gnus.org>
> Date: Fri, 17 Sep 2010 18:24:34 +0200
> 
> The thing I'm must unclear on now is whether it's valid to say
> 
> int thing = PT;
> 
> and whether it's valid to say
> 
> PT + 1;

PT is just a number, an integral data type.  So adding one to it is
okay.  But storing into an `int' is not, because PT is an EMACS_INT,
see `struct buffer':

    struct buffer
    {
      ...
      /* Char position of point in buffer.  */
      EMACS_INT pt;
      /* Byte position of point in buffer.  */
      EMACS_INT pt_byte;

EMACS_INT is a 64-bit data type on 64-bit machines, so assigning it to
an int is a bug waiting to happen.  You should do this instead:

  EMACS_INT thing = PT;

> I've grepped through the code, and this seems to be used all over the
> place

Each place where you see PT assigned to an int is a bug, please either
report it or fix it right away.

> so I'm guessing that perhaps the size of a buffer is constrained
> to be less than INT_MAX?

No, it's constrained by most-positive-fixnum (MOST_POSITIVE_FIXNUM in
C), but we still have many places where we use a int, which is why
buffers larger than MAX_INT not always work well.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 16:39                                     ` Eli Zaretskii
@ 2010-09-17 17:30                                       ` Lars Magne Ingebrigtsen
  2010-09-17 18:49                                         ` Eli Zaretskii
  0 siblings, 1 reply; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-17 17:30 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> I've grepped through the code, and this seems to be used all over the
>> place
>
> Each place where you see PT assigned to an int is a bug, please either
> report it or fix it right away.

Right.  Things like 

      int pt = PT;

in buffer.c is easy enough, but is the following (from insdel.c)
correct?  

      int b = XINT (Fmarker_position (current_buffer->mark));
      int e = XINT (make_number (PT));

I don't really understand the last line at all.  It first creates a
Lisp_Object number from PT, and then gets the C-level EMACS_INT value
from that again?  And then casts it to an int? 
      
-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 17:30                                       ` Lars Magne Ingebrigtsen
@ 2010-09-17 18:49                                         ` Eli Zaretskii
  0 siblings, 0 replies; 97+ messages in thread
From: Eli Zaretskii @ 2010-09-17 18:49 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen, Chong Yidong; +Cc: emacs-devel

> From: Lars Magne Ingebrigtsen <larsi@gnus.org>
> Date: Fri, 17 Sep 2010 19:30:52 +0200
> 
> Right.  Things like 
> 
>       int pt = PT;
> 
> in buffer.c is easy enough, but is the following (from insdel.c)
> correct?  
> 
>       int b = XINT (Fmarker_position (current_buffer->mark));

I think it's a bug, should use "EMACS_INT b".

>       int e = XINT (make_number (PT));
> 
> I don't really understand the last line at all.  It first creates a
> Lisp_Object number from PT, and then gets the C-level EMACS_INT value
> from that again?  And then casts it to an int? 

I think it's a bug, should use "EMACS_INT e = PT;"

As for why it converts it to Lisp integer and then back to a C
EMACS_INT: it seems to be a historical curiosity.  Revision 101018
made this change:

 === modified file 'src/insdel.c'
 --- src/insdel.c        2010-08-07 19:39:04 +0000
 +++ src/insdel.c        2010-08-07 20:26:55 +0000
 @@ -2055,13 +2055,12 @@ prepare_to_modify_buffer (EMACS_INT star
	&& !NILP (Vtransient_mark_mode)
	&& NILP (Vsaved_region_selection))
      {
 -      Lisp_Object b = Fmarker_position (current_buffer->mark);
 -      Lisp_Object e = make_number (PT);
 -      if (NILP (Fequal (b, e)))
 -       {
 -         validate_region (&b, &e);
 -         Vsaved_region_selection = make_buffer_string (XINT (b), XINT (e), 0);
 -       }
 +      int b = XINT (Fmarker_position (current_buffer->mark));
 +      int e = XINT (make_number (PT));
 +      if (b < e)
 +       Vsaved_region_selection = make_buffer_string (b, e, 0);
 +      else if (b > e)
 +       Vsaved_region_selection = make_buffer_string (e, b, 0);
      }

I guess the change mechanically used XINT to move the comparison to
C-level integers, but didn't pay attention to the fact that it would
be easier to just remove the make_number call and use PT directly.

Chong, did I miss something?



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 16:24                                   ` Lars Magne Ingebrigtsen
  2010-09-17 16:39                                     ` Eli Zaretskii
@ 2010-09-17 16:39                                     ` Eli Zaretskii
  1 sibling, 0 replies; 97+ messages in thread
From: Eli Zaretskii @ 2010-09-17 16:39 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

> From: Lars Magne Ingebrigtsen <larsi@gnus.org>
> Date: Fri, 17 Sep 2010 18:24:34 +0200
> 
> The thing I'm must unclear on now is whether it's valid to say
> 
> int thing = PT;
> 
> and whether it's valid to say
> 
> PT + 1;

PT is just a number, an integral data type.  So adding one to it is
okay.  But storing into an `int' is not, because PT is an EMACS_INT,
see `struct buffer':

    struct buffer
    {
      ...
      /* Char position of point in buffer.  */
      EMACS_INT pt;
      /* Byte position of point in buffer.  */
      EMACS_INT pt_byte;

EMACS_INT is a 64-bit data type on 64-bit machines, so assigning it to
an int is a bug waiting to happen.  You should do this instead:

  EMACS_INT thing = PT;

> I've grepped through the code, and this seems to be used all over the
> place

Each place where you see PT assigned to an int is a bug, please either
report it or fix it right away.

> so I'm guessing that perhaps the size of a buffer is constrained
> to be less than INT_MAX?

No, it's constrained by most-positive-fixnum (MOST_POSITIVE_FIXNUM in
C), but we still have many places where we use a int, which is why
buffers larger than MAX_INT not always work well.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 13:31                         ` David Kastrup
  2010-09-17 13:39                           ` Lars Magne Ingebrigtsen
@ 2010-09-17 13:49                           ` Andreas Schwab
  2010-09-17 13:55                             ` Lars Magne Ingebrigtsen
  1 sibling, 1 reply; 97+ messages in thread
From: Andreas Schwab @ 2010-09-17 13:49 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup <dak@gnu.org> writes:

> Not really sure about that.
>
> + looking_at_literally (Lisp_Object string)
> + {
> +   int start_byte = CHAR_TO_BYTE (PT);
                       PT_BYTE
> +   int end_byte, end = PT + SCHARS (string);
>
> PT + SCHARS (string) can overflow here.  Better check first rather than
> later whether ZV - PT < SCHARS (string).
>
> Yes, I know that most-positive-fixnum <= MAX_INT/2

How do you "know" that?

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 13:49                           ` Andreas Schwab
@ 2010-09-17 13:55                             ` Lars Magne Ingebrigtsen
  2010-09-17 14:31                               ` Wojciech Meyer
  2010-09-17 14:40                               ` Andreas Schwab
  0 siblings, 2 replies; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-17 13:55 UTC (permalink / raw)
  To: emacs-devel

Andreas Schwab <schwab@linux-m68k.org> writes:

>> PT + SCHARS (string) can overflow here.  Better check first rather than
>> later whether ZV - PT < SCHARS (string).
>>
>> Yes, I know that most-positive-fixnum <= MAX_INT/2
>
> How do you "know" that?

Don't the Lisp integers use a bit for the type tag?

Anyway, I agree with the change (and I've fixed it along the lines of
what David said), but it seems rather, er, unlikely that a buffer would
approach MAX_INT in size.  But perhaps that's just me.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 13:55                             ` Lars Magne Ingebrigtsen
@ 2010-09-17 14:31                               ` Wojciech Meyer
  2010-09-17 14:40                               ` Andreas Schwab
  1 sibling, 0 replies; 97+ messages in thread
From: Wojciech Meyer @ 2010-09-17 14:31 UTC (permalink / raw)
  To: emacs-devel

On Fri, Sep 17, 2010 at 2:55 PM, Lars Magne Ingebrigtsen <larsi@gnus.org> wrote:
> Andreas Schwab <schwab@linux-m68k.org> writes:
>
>>> PT + SCHARS (string) can overflow here.  Better check first rather than
>>> later whether ZV - PT < SCHARS (string).
>>>
>>> Yes, I know that most-positive-fixnum <= MAX_INT/2
>>
>> How do you "know" that?
>
> Don't the Lisp integers use a bit for the type tag?

Precise GC requires to distinguish between pointers and integers.
Sine the pointers are always aligned, the least significant bit can
be used for tagging integer.

However, i am not sure what is approach in Emacs, reading docs.

Wojciech



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 13:55                             ` Lars Magne Ingebrigtsen
  2010-09-17 14:31                               ` Wojciech Meyer
@ 2010-09-17 14:40                               ` Andreas Schwab
  2010-09-17 14:47                                 ` Lars Magne Ingebrigtsen
  1 sibling, 1 reply; 97+ messages in thread
From: Andreas Schwab @ 2010-09-17 14:40 UTC (permalink / raw)
  To: emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> Andreas Schwab <schwab@linux-m68k.org> writes:
>
>>> PT + SCHARS (string) can overflow here.  Better check first rather than
>>> later whether ZV - PT < SCHARS (string).
>>>
>>> Yes, I know that most-positive-fixnum <= MAX_INT/2
>>
>> How do you "know" that?
>
> Don't the Lisp integers use a bit for the type tag?

most-positive-fixnum is a variable defined in `data.c'.
Its value is 2305843009213693951

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 14:40                               ` Andreas Schwab
@ 2010-09-17 14:47                                 ` Lars Magne Ingebrigtsen
  2010-09-17 15:10                                   ` Andreas Schwab
  0 siblings, 1 reply; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-17 14:47 UTC (permalink / raw)
  To: emacs-devel

Andreas Schwab <schwab@linux-m68k.org> writes:

>> Don't the Lisp integers use a bit for the type tag?
>
> most-positive-fixnum is a variable defined in `data.c'.
> Its value is 2305843009213693951

And by that you mean "yes" or "no"?

(format "%x" most-positive-fixnum)
=> "1fffffffffffffff"

That's at least a few bits less than MAX_INT, isn't it?

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 14:47                                 ` Lars Magne Ingebrigtsen
@ 2010-09-17 15:10                                   ` Andreas Schwab
  2010-09-17 15:16                                     ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 97+ messages in thread
From: Andreas Schwab @ 2010-09-17 15:10 UTC (permalink / raw)
  To: emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> Andreas Schwab <schwab@linux-m68k.org> writes:
>
>>> Don't the Lisp integers use a bit for the type tag?
>>
>> most-positive-fixnum is a variable defined in `data.c'.
>> Its value is 2305843009213693951
>
> And by that you mean "yes" or "no"?
>
> (format "%x" most-positive-fixnum)
> => "1fffffffffffffff"
>
> That's at least a few bits less than MAX_INT, isn't it?

$ printf '#include <limits.h>\nINT_MAX\n' | gcc -E -xc - | tail -n1
2147483647

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 15:10                                   ` Andreas Schwab
@ 2010-09-17 15:16                                     ` Lars Magne Ingebrigtsen
  2010-09-17 15:39                                       ` Andreas Schwab
  2010-09-17 16:14                                       ` Eli Zaretskii
  0 siblings, 2 replies; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-17 15:16 UTC (permalink / raw)
  To: emacs-devel

Andreas Schwab <schwab@linux-m68k.org> writes:

>>>> Don't the Lisp integers use a bit for the type tag?
>>>
>>> most-positive-fixnum is a variable defined in `data.c'.
>>> Its value is 2305843009213693951
>>
>> And by that you mean "yes" or "no"?
>>
>> (format "%x" most-positive-fixnum)
>> => "1fffffffffffffff"
>>
>> That's at least a few bits less than MAX_INT, isn't it?
>
> $ printf '#include <limits.h>\nINT_MAX\n' | gcc -E -xc - | tail -n1
> 2147483647

You're being rather gnomic.  That most-positive-fixnum is a 64-bit
number in your Emacs, but that you have an include file somewhere that
says that INT_MAX is a 32-bit number doesn't really make much sense.

On a 32-bit machine, this is what I get.

(format "%x" most-positive-fixnum)
=> "fffffff"

Instead of posting these snippets, it would make the discussion go much
quicker if you actually said what it is you were trying to convey by
posting these numbers without comment.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 15:16                                     ` Lars Magne Ingebrigtsen
@ 2010-09-17 15:39                                       ` Andreas Schwab
  2010-09-17 15:42                                         ` Lars Magne Ingebrigtsen
  2010-09-17 16:14                                       ` Eli Zaretskii
  1 sibling, 1 reply; 97+ messages in thread
From: Andreas Schwab @ 2010-09-17 15:39 UTC (permalink / raw)
  To: emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> You're being rather gnomic.  That most-positive-fixnum is a 64-bit
> number in your Emacs, but that you have an include file somewhere that
> says that INT_MAX is a 32-bit number doesn't really make much sense.

Which part of "int is a 32-bit type" do you not understand?

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 15:39                                       ` Andreas Schwab
@ 2010-09-17 15:42                                         ` Lars Magne Ingebrigtsen
  2010-09-17 16:04                                           ` Andreas Schwab
  0 siblings, 1 reply; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-17 15:42 UTC (permalink / raw)
  To: emacs-devel

Andreas Schwab <schwab@linux-m68k.org> writes:

> Which part of "int is a 32-bit type" do you not understand?

The part where you said "int is a 32-bit type, but Lisp Integers
aren't"?  See how much faster a discussion can go if you use words? 

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 15:42                                         ` Lars Magne Ingebrigtsen
@ 2010-09-17 16:04                                           ` Andreas Schwab
  0 siblings, 0 replies; 97+ messages in thread
From: Andreas Schwab @ 2010-09-17 16:04 UTC (permalink / raw)
  To: emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> Andreas Schwab <schwab@linux-m68k.org> writes:
>
>> Which part of "int is a 32-bit type" do you not understand?
>
> The part where you said "int is a 32-bit type, but Lisp Integers
> aren't"?

DK> Yes, I know that most-positive-fixnum <= MAX_INT/2

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 15:16                                     ` Lars Magne Ingebrigtsen
  2010-09-17 15:39                                       ` Andreas Schwab
@ 2010-09-17 16:14                                       ` Eli Zaretskii
  2010-09-17 19:22                                         ` James Cloos
  1 sibling, 1 reply; 97+ messages in thread
From: Eli Zaretskii @ 2010-09-17 16:14 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

> From: Lars Magne Ingebrigtsen <larsi@gnus.org>
> Date: Fri, 17 Sep 2010 17:16:25 +0200
> 
> Andreas Schwab <schwab@linux-m68k.org> writes:
> 
> >>>> Don't the Lisp integers use a bit for the type tag?
> >>>
> >>> most-positive-fixnum is a variable defined in `data.c'.
> >>> Its value is 2305843009213693951
> >>
> >> And by that you mean "yes" or "no"?
> >>
> >> (format "%x" most-positive-fixnum)
> >> => "1fffffffffffffff"
> >>
> >> That's at least a few bits less than MAX_INT, isn't it?
> >
> > $ printf '#include <limits.h>\nINT_MAX\n' | gcc -E -xc - | tail -n1
> > 2147483647
> 
> You're being rather gnomic.

Psst, Lars: it's pointless to ask Andreas for human-readable
explanations.  You won't get them.  He enjoys to get you puzzled.

The issue here is that EMACS_INT can be a 64-bit type (on a 64-bit
host), but MAX_INT is always the maximum possible value of a 32-bit
int, even on a 64-bit machine.



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 16:14                                       ` Eli Zaretskii
@ 2010-09-17 19:22                                         ` James Cloos
  0 siblings, 0 replies; 97+ messages in thread
From: James Cloos @ 2010-09-17 19:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Lars Magne Ingebrigtsen, emacs-devel

>>>>> "EZ" == Eli Zaretskii <eliz@gnu.org> writes:

EZ> The issue here is that EMACS_INT can be a 64-bit type (on a 64-bit
EZ> host), but MAX_INT is always the maximum possible value of a 32-bit
EZ> int, even on a 64-bit machine.

More precisely that many 64 bit systems have sizeof(int)*CHAR_BIT == 32.

Alpha is an obvious exception.

It would have been much easier had someone just wrote s/INT_MAX/LONG_MAX/g.
(Except on DOZE64, of course, where sizeof(long)==sizeof(int)==4.)

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 13:09                       ` Lars Magne Ingebrigtsen
  2010-09-17 13:31                         ` David Kastrup
@ 2010-09-17 17:40                         ` Stephen J. Turnbull
  2010-09-17 19:40                           ` Lars Magne Ingebrigtsen
  1 sibling, 1 reply; 97+ messages in thread
From: Stephen J. Turnbull @ 2010-09-17 17:40 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

Lars Magne Ingebrigtsen writes:

 > I think that's a rather ... stunning approach, but might explain
 > many things about XEmacs.

I rather suspect it does. ;-)

 > I've presented a use case, and I've demonstrated how all the
 > alternative implementations are 50-150% slower than my suggested
 > new implementation, and I've done the implementation, which turned
 > out to be totally trivial.

Nothing that deals with Emacs strings or buffers is totally trivial,
as a general principle.

Then, again, it looks like David has discovered at least one bug
(texts with different values of multibyteness), maybe more (bounds
checking and integral type confusion), in your "totally trivial"
implementation already.

 > What more do you need?

First, I'm curious which machine and what data (buffer) you're using
that took 9 seconds to run that benchmark.  When I ran the benchmark
(environment described below) using XEmacs on a 1.8MHz Opteron machine
(quad core, but XEmacs can't take advantage of it) I discovered that
10,000 iterations takes ~300ms uncompiled, ~200ms compiled, and ~150ms
compiled and with the call to `search-forward' in question replaced
with a call to `ignore'.  I don't see a win here unless you're really
calling that function 10,000 times or more, and in a very tight loop.

So, is it really Gnus's habit to execute that form 10,000 times in a
loop so that its execution time dominates the user's perceived lag
time?  I bet that most uses involve parsing 20-40 RFC 822-style
headers, and the rest parse lines lazily.  If so, even the reported 9
second benchmark really amounts to a total of 50-100ms, which is less
than the "just noticable difference" for a fast human.

***** You can stop reading unless you really want to know the details. *****

Specifically, profiling 10000 iterations in a 997-byte buffer
containing three instances of "^(defun " (none at BOB) returned almost
faster than I can detect (288ms), based on

(defun is-the-string-following-point-equal-to-this-string-p (s)
  (search-forward s (+ (point) (length s)) t))

(defun test ()
  (while (or (is-the-string-following-point-equal-to-this-string-p "(defun ")
             (search-forward "\n(defun " nil t))
    (forward-line 1)))

(profile-expression '(let ((i 10000))
                        (while (> i 0)
                          (goto-char (point-min))
                          (test)
                          (setq i (1- i)))))

giving the profiling results below (note, all functions defined above
are *uncompiled*).  (Note that the total in the Ticks column may not
be the sum of the Ticks; this apparently has to do with reentering the
profiler and Ben didn't think it was worth fixing it.)

Function Name      Ticks/Total %Usage Calls GC-Usage/  Total
========================/===== ====== ===== ========/=======
search-forward       121/  121 42.160 80000                 
(profile overhead)   101/  101 35.192                       
is-the-string-following-point-equal-to-this-string-p
                      24/  129  8.362 40000                 
while                 11/  288  3.833 10001                 
or                     8/  231  2.787 40000                 
+                      6/    6  2.091 40000                 
forward-line           5/    5  1.742 30000                 
setq                   5/    6  1.742 10000                 
point                  2/    2  0.697 40000                 
test                   2/  262  0.697 10000                 
length                 1/    1  0.348 40000                 
>                      1/    1  0.348 10001                 
let                    0/  288  0.000     1                 
point-min              0/    0  0.000 10000                 
1-                     0/    0  0.000 10000                 
goto-char              0/    0  0.000 10000                 
------------------------------------------------------------
Total                288      100.000 380005        0


Ticks/Total     = Ticks this function/this function and descendants
Calls           = Number of calls to this function
GC-Usage/Total  = Lisp allocation this function/this function and descendants
One tick        = 1 ms

Compiled (including the profiling expression, that's `test1'), the
result is

Function Name      Ticks/Total %Usage Calls GC-Usage/  Total
========================/===== ====== ===== ========/=======
search-forward       112/  112 59.574 80000                 
(profile overhead)    42/   42 22.340                       
is-the-string-following-point-equal-to-this-string-p
                      16/   85  8.511 40000                 
test                  13/  181  6.915 10000                 
test1                  5/  190  2.660     1                 
------------------------------------------------------------
Total                190      100.000 130003        0


Ticks/Total     = Ticks this function/this function and descendants
Calls           = Number of calls to this function
GC-Usage/Total  = Lisp allocation this function/this function and descendants
One tick        = 1 ms

And here's the result with the search-forward in the predicate
replaced with ignore (which apparently conses because of the &rest
argument, and I'm not sure why the number is so huge):

Function Name      Ticks/Total %Usage Calls GC-Usage/  Total
========================/===== ====== ===== ========/=======
search-forward        74/   74 44.578 40000                 
(profile overhead)    32/   32 19.277                       
test                  24/  158 14.458 10000        0/3840000
ignore                19/   22 11.446 40000  3840000/3840000
is-the-string-following-point-equal-to-this-string-p
                      11/   46  6.627 40000        0/3840000
test1                  6/  166  3.614     1        0/3840000
------------------------------------------------------------
Total                  0      100.000 130003  3840000


Ticks/Total     = Ticks this function/this function and descendants
Calls           = Number of calls to this function
GC-Usage/Total  = Lisp allocation this function/this function and descendants
One tick        = 1 ms



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-17 17:40                         ` Stephen J. Turnbull
@ 2010-09-17 19:40                           ` Lars Magne Ingebrigtsen
  0 siblings, 0 replies; 97+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-09-17 19:40 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Then, again, it looks like David has discovered at least one bug
> (texts with different values of multibyteness), maybe more (bounds
> checking and integral type confusion), in your "totally trivial"
> implementation already.

Well, I didn't say the trivial code was bug-free.  :-)

> First, I'm curious which machine and what data (buffer) you're using
> that took 9 seconds to run that benchmark.

I was running it over the gnus-sum.el file.

> So, is it really Gnus's habit to execute that form 10,000 times in a
> loop so that its execution time dominates the user's perceived lag
> time?  I bet that most uses involve parsing 20-40 RFC 822-style
> headers, and the rest parse lines lazily.  If so, even the reported 9
> second benchmark really amounts to a total of 50-100ms, which is less
> than the "just noticable difference" for a fast human.

The reason I thought of this again now is that I'm doing IMAP handling.
The only way in IMAP to get info on marks and stuff it to get one line
per message, so if you have a 100K mail box, it's going to take some
time to sync up your marks.

Your example of "20-40" is somewhat irrelevant.  Of course everything is
fast enough if you just have little enough data.  The problem is getting
things to work fast enough in the presence of a lot of data.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 14:59         ` Stefan Monnier
  2010-09-15 15:09           ` Lars Magne Ingebrigtsen
@ 2010-09-15 15:46           ` Helmut Eller
  2010-09-15 16:28             ` Thomas Lord
  1 sibling, 1 reply; 97+ messages in thread
From: Helmut Eller @ 2010-09-15 15:46 UTC (permalink / raw)
  To: emacs-devel

* Stefan Monnier [2010-09-15 14:59] writes:

>>> - The main problem with Emacs regexps right now is that they have
>>> pathological cases where the match-time is enormous (potentially
>>> exponential explosion in the size of the input string).  To be
>>> worthwhile a replacement should address this problem, which basically
>>> needs it should not be based on backtracking.
>> Is it possible (theoretically) to implement all of Emacs regexps without
>> backtracking?  In particular those with back-references (\N) seem
>> problematic.  Or is it necessary to recognize "optimizable" regexps
>> before using a different regexp engine?
>
> IIRC regexps without back-refs can be matched (and searched) in O(N)
> where N is the length of the input.  With back-refs, I think (not sure)
> the theoretical bound is O(N^2), which requires
> a non-backtracking algorithm.
>
> So yes, we'd need to handle back-refs specially.  Several regexp engines
> do that already (they have a few different inner engines and choose
> which one to use based on the particular regexp at hand).

After googleing a bit I found this page 
http://swtch.com/~rsc/regexp/regexp1.html
which again links to this
http://perl.plover.com/NPC/NPC-3SAT.html
which says that regexp matching with backreferences is NP-complete.

Cox (the first page) seems to say that backtracking-with-memoization is
linear time at the expense of O(N) space.

Helmut




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 15:46           ` Helmut Eller
@ 2010-09-15 16:28             ` Thomas Lord
  0 siblings, 0 replies; 97+ messages in thread
From: Thomas Lord @ 2010-09-15 16:28 UTC (permalink / raw)
  To: Helmut Eller; +Cc: emacs-devel

On Wed, 2010-09-15 at 17:46 +0200, Helmut Eller wrote:
> * Stefan Monnier [2010-09-15 14:59] writes:

> > IIRC regexps without back-refs can be matched (and searched) in O(N)
> > where N is the length of the input.  

Not quite.   Let R be the length of the pattern
and L be the length of the string we seek to match.

Assume that the pattern is a true regular expression
(no back-references, no sub-expression position
reporting, etc.)

The problem itself is O(R * L):  no algorithm can
guarantee doing better than the product of the two
lengths.

There is an algorithm (compiling the pattern to a DFA
first) which is O(2^R + L). Formally it is suboptimal.
There is a catch though.   For many common patterns,
either R is very small or we don't actually need
anything like 2^R steps to compile the pattern,
and so the *expected* case (for these patterns)
is O(L).

The Thompson "DFA caching algorithm" described briefly
in the Dragon compiler book is quite attractive
because it offers (for true regular expressions) a 
worst case complexity of O(R * L) ... so it is optimal ...
yet also delivers much closer to O(L) for many
common patterns.   The algorithm is a bit 
unattractive because it is tricky to implement 
well, very hard to generalize beyond true regular
expressions, and can easily lose to more naive NFA
algorithms by small but annoying constant factors.

> With back-refs, I think (not sure)
> > the theoretical bound is O(N^2), which requires
> > a non-backtracking algorithm.

> > So yes, we'd need to handle back-refs specially.  Several regexp engines
> > do that already (they have a few different inner engines and choose
> > which one to use based on the particular regexp at hand).
> 
> After googleing a bit I found this page 
> http://swtch.com/~rsc/regexp/regexp1.html
> which again links to this
> http://perl.plover.com/NPC/NPC-3SAT.html
> which says that regexp matching with backreferences is NP-complete.

The best known algorithms are not O(N^2) but,
rather, O(2^N) in the worst case.   It's dismal
how such an innocent seeming feature can cause
such havoc.

> Cox (the first page) seems to say that backtracking-with-memoization is
> linear time at the expense of O(N) space.

If he said that regarding worst case performance
and without mentioning some specific subset of 
true regular expressions that he's talking about,
he must have made a mistake.

For true regular expressions, the simpler case,
you can not beat O(R * L) time as the worst case.

-t

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Compiling Elisp to a native code with a GCC plugin
  2010-09-15 14:07     ` Stefan Monnier
  2010-09-15 14:27       ` Helmut Eller
@ 2010-09-15 21:04       ` Leo
  1 sibling, 0 replies; 97+ messages in thread
From: Leo @ 2010-09-15 21:04 UTC (permalink / raw)
  To: emacs-devel

On 2010-09-15 15:07 +0100, Stefan Monnier wrote:
> - The main problem with Emacs regexps right now is that they have
> pathological cases where the match-time is enormous (potentially
> exponential explosion in the size of the input string). To be
> worthwhile a replacement should address this problem, which basically
> needs it should not be based on backtracking.

Any good free regexp libs that suffer no such problem?

I have found http://code.google.com/p/re2/ by google but it is written
in C++.

Leo




^ permalink raw reply	[flat|nested] 97+ messages in thread

end of thread, other threads:[~2010-09-18 15:58 UTC | newest]

Thread overview: 97+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-14 19:12 Compiling Elisp to a native code with a GCC plugin Wojciech Meyer
2010-09-14 19:32 ` Tom Tromey
2010-09-14 19:45   ` Wojciech Meyer
2010-09-14 20:17     ` Lars Magne Ingebrigtsen
2010-09-14 20:52       ` Wojciech Meyer
2010-09-14 20:55       ` Tom Tromey
2010-09-14 21:05         ` Wojciech Meyer
2010-09-14 20:44     ` Tom Tromey
2010-09-14 21:00       ` Wojciech Meyer
2010-09-14 21:16         ` Tom Tromey
2010-09-14 21:29           ` Wojciech Meyer
2010-09-14 21:59             ` Tom Tromey
2010-09-14 22:37               ` Wojciech Meyer
2010-09-14 22:55                 ` Tom Tromey
2010-09-14 23:33                   ` Wojciech Meyer
2010-09-15  1:38                     ` Tom Tromey
2010-09-14 22:49               ` Wojciech Meyer
2010-09-14 23:13           ` Thomas Lord
2010-09-14 23:42             ` Wojciech Meyer
2010-09-15 10:47   ` Leo
2010-09-15 11:41     ` Andreas Schwab
2010-09-15 12:10       ` Wojciech Meyer
2010-09-15 14:07     ` Stefan Monnier
2010-09-15 14:27       ` Helmut Eller
2010-09-15 14:59         ` Stefan Monnier
2010-09-15 15:09           ` Lars Magne Ingebrigtsen
2010-09-15 15:31             ` Andreas Schwab
2010-09-15 15:35               ` Lars Magne Ingebrigtsen
2010-09-15 16:28                 ` Andreas Schwab
2010-09-16 16:57                   ` Lars Magne Ingebrigtsen
2010-09-15 15:42             ` Stefan Monnier
2010-09-15 15:51               ` Lars Magne Ingebrigtsen
2010-09-15 15:57                 ` Leo
2010-09-15 16:01                   ` Lars Magne Ingebrigtsen
2010-09-15 16:05                   ` David Kastrup
2010-09-15 16:23                     ` Leo
2010-09-15 16:37                       ` David Kastrup
2010-09-16 16:58                         ` Lars Magne Ingebrigtsen
2010-09-16 21:11                           ` Andreas Schwab
2010-09-16 23:17                             ` Lars Magne Ingebrigtsen
2010-09-17  8:13                               ` Eli Zaretskii
2010-09-17 13:17                                 ` Lars Magne Ingebrigtsen
2010-09-17 13:30                                   ` Eli Zaretskii
2010-09-17 13:34                                     ` Lars Magne Ingebrigtsen
2010-09-16 17:35                         ` Lars Magne Ingebrigtsen
2010-09-16  2:57                 ` Stephen J. Turnbull
2010-09-16  6:54                   ` David Kastrup
2010-09-16  8:10                     ` Stephen J. Turnbull
2010-09-16  8:31                       ` David Kastrup
2010-09-16 17:01                   ` Lars Magne Ingebrigtsen
2010-09-17  6:52                     ` Stephen J. Turnbull
2010-09-17 13:09                       ` Lars Magne Ingebrigtsen
2010-09-17 13:31                         ` David Kastrup
2010-09-17 13:39                           ` Lars Magne Ingebrigtsen
2010-09-17 13:55                             ` David Kastrup
2010-09-17 14:18                               ` Lars Magne Ingebrigtsen
2010-09-17 14:57                                 ` David Kastrup
2010-09-17 15:06                                   ` Lars Magne Ingebrigtsen
2010-09-17 15:24                                     ` Lars Magne Ingebrigtsen
2010-09-17 16:11                                       ` Eli Zaretskii
2010-09-17 16:33                                       ` David Kastrup
2010-09-17 16:41                                         ` Andreas Schwab
2010-09-17 17:17                                           ` David Kastrup
2010-09-17 18:24                                             ` David Kastrup
2010-09-17 20:30                                               ` David Kastrup
2010-09-17 20:49                                                 ` Lars Magne Ingebrigtsen
2010-09-18  4:31                                                   ` David Kastrup
2010-09-17 18:53                                             ` Stephen J. Turnbull
2010-09-17 20:57                                               ` Eli Zaretskii
2010-09-18 14:19                                                 ` Stephen J. Turnbull
2010-09-18 15:46                                                   ` Eli Zaretskii
2010-09-18 15:58                                                   ` Stefan Monnier
2010-09-17 17:24                                         ` Lars Magne Ingebrigtsen
2010-09-17 16:11                                     ` David Kastrup
2010-09-17 16:18                                 ` Eli Zaretskii
2010-09-17 16:24                                   ` Lars Magne Ingebrigtsen
2010-09-17 16:39                                     ` Eli Zaretskii
2010-09-17 17:30                                       ` Lars Magne Ingebrigtsen
2010-09-17 18:49                                         ` Eli Zaretskii
2010-09-17 16:39                                     ` Eli Zaretskii
2010-09-17 13:49                           ` Andreas Schwab
2010-09-17 13:55                             ` Lars Magne Ingebrigtsen
2010-09-17 14:31                               ` Wojciech Meyer
2010-09-17 14:40                               ` Andreas Schwab
2010-09-17 14:47                                 ` Lars Magne Ingebrigtsen
2010-09-17 15:10                                   ` Andreas Schwab
2010-09-17 15:16                                     ` Lars Magne Ingebrigtsen
2010-09-17 15:39                                       ` Andreas Schwab
2010-09-17 15:42                                         ` Lars Magne Ingebrigtsen
2010-09-17 16:04                                           ` Andreas Schwab
2010-09-17 16:14                                       ` Eli Zaretskii
2010-09-17 19:22                                         ` James Cloos
2010-09-17 17:40                         ` Stephen J. Turnbull
2010-09-17 19:40                           ` Lars Magne Ingebrigtsen
2010-09-15 15:46           ` Helmut Eller
2010-09-15 16:28             ` Thomas Lord
2010-09-15 21:04       ` Leo

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).