unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* search across linebreaks
@ 2013-02-17  7:43 Eric Abrahamsen
  2013-02-17 13:13 ` Jude DaShiell
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Eric Abrahamsen @ 2013-02-17  7:43 UTC (permalink / raw)
  To: help-gnu-emacs

I'm going to need to do a large scale search-and-replace on a series of
text files, using a sort of dictionary or hash-table of search terms and
their replacement. The text files are filled to the usual fill column.
The search terms may be broken across linebreaks, and I'm not sure of
the best way to handle this. If it was regular English words I could
probably manage a programmatic version of `isearch-toggle-word', but in
this case these are solid strings, and might be broken anywhere.

The two solutions I can think of are: 1) break up the characters in the
search string and insert "\n?" between each one to create regexps to
search on, and 2) unfill the whole file at the start of the procedure
and then refill it afterwards. Neither of these seems like a great
idea -- does anyone have any brighter ideas?

Thanks,
Eric




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: search across linebreaks
  2013-02-17  7:43 search across linebreaks Eric Abrahamsen
@ 2013-02-17 13:13 ` Jude DaShiell
       [not found] ` <mailman.20189.1361106838.855.help-gnu-emacs@gnu.org>
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Jude DaShiell @ 2013-02-17 13:13 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: help-gnu-emacs

On Sun, 17 Feb 2013, Eric Abrahamsen wrote:

> I'm going to need to do a large scale search-and-replace on a series of
> text files, using a sort of dictionary or hash-table of search terms and
> their replacement. The text files are filled to the usual fill column.
> The search terms may be broken across linebreaks, and I'm not sure of
> the best way to handle this. If it was regular English words I could
> probably manage a programmatic version of `isearch-toggle-word', but in
> this case these are solid strings, and might be broken anywhere.
> 
> The two solutions I can think of are: 1) break up the characters in the
> search string and insert "\n?" between each one to create regexps to
> search on, and 2) unfill the whole file at the start of the procedure
> and then refill it afterwards. Neither of these seems like a great
> idea -- does anyone have any brighter ideas?
> 
> Thanks,
> Eric
> 
Think sed, and do yourself a huge favor and find a really experienced sed 
user because with the complexity of what you want to do now, that will 
probably be your best insurance against not leaving your project in a 
total mess.  If you must do all of this yourself, please consider sending 
blank email to sed-users-subscribe@yahoogroups.com and ask some questions 
on that email list before you take definitive action on your project.  
Also, remember there are people who belive in the religion of backups and 
then there are people who are going to get religion.  The sed-users email 
list is usually high signal and low noise and usually low volume in terms 
of traffic passed.
> > 

--------------------------------------------------------------------------- 
jude <jdashiel@shellworld.net> Remember Microsoft didn't write Tiger 10.4 
or any of its successors.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: search across linebreaks
       [not found] ` <mailman.20189.1361106838.855.help-gnu-emacs@gnu.org>
@ 2013-02-17 14:43   ` J G Miller
  0 siblings, 0 replies; 10+ messages in thread
From: J G Miller @ 2013-02-17 14:43 UTC (permalink / raw)
  To: help-gnu-emacs

On Sunday, February 17th, 2013, at 08:13:52h -0500,
Jude DaShiell suggested:

> On Sun, 17 Feb 2013, Eric Abrahamsen wrote:
> 
>> I'm going to need to do a large scale search-and-replace on a series of
>> text files, using a sort of dictionary or hash-table of search terms and
>> their replacement.
   ...
>> 
> Think sed ...

For what is required, sed is not going to be adequate.

awk at the very minimum, and preferably perl, or for those
who are object oriented, python, or maybe ruby.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: search across linebreaks
  2013-02-17  7:43 search across linebreaks Eric Abrahamsen
  2013-02-17 13:13 ` Jude DaShiell
       [not found] ` <mailman.20189.1361106838.855.help-gnu-emacs@gnu.org>
@ 2013-02-17 15:52 ` Drew Adams
  2013-02-18  3:52   ` Eric Abrahamsen
  2013-02-17 17:05 ` Andreas Röhler
  2013-02-18 13:09 ` Nicolas Richard
  4 siblings, 1 reply; 10+ messages in thread
From: Drew Adams @ 2013-02-17 15:52 UTC (permalink / raw)
  To: 'Eric Abrahamsen', help-gnu-emacs

> I'm going to need to do a large scale search-and-replace on a 
> series of text files, using a sort of dictionary or hash-table of 
> search terms and their replacement. The text files are filled
> to the usual fill column.  The search terms may be broken across
> linebreaks, and I'm not sure of the best way to handle this.
> If it was regular English words I could probably manage a
> programmatic version of `isearch-toggle-word', but in
> this case these are solid strings, and might be broken anywhere.
> 
> The two solutions I can think of are: 1) break up the characters
> in the search string and insert "\n?" between each one to create
> regexps to search on, and 2) unfill the whole file at the start
> of the procedure and then refill it afterwards. Neither of these
> seems like a great idea -- does anyone have any brighter ideas?

What's not clear is whether any of the newline chars are significant.  From what
you wrote I'm guessing no: they can all be ignored or just removed.  But in that
case, filling would mean filling one big paragraph.

Or perhaps consecutive newlines (\n\n) are significant, separating paragraphs?
In that case, you could remove all newlines except one for each consecutive
group (i.e., paragraph separation).

Assuming no newlines are significant (or only one of consecutive ones is), the
two solutions you propose sound reasonable to me.  Which of them to use might
depend on size etc. - relative time to remove newlines and later refill vs the
\n? regexp match time.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: search across linebreaks
  2013-02-17  7:43 search across linebreaks Eric Abrahamsen
                   ` (2 preceding siblings ...)
  2013-02-17 15:52 ` Drew Adams
@ 2013-02-17 17:05 ` Andreas Röhler
  2013-02-18 13:09 ` Nicolas Richard
  4 siblings, 0 replies; 10+ messages in thread
From: Andreas Röhler @ 2013-02-17 17:05 UTC (permalink / raw)
  To: help-gnu-emacs

Am 17.02.2013 08:43, schrieb Eric Abrahamsen:
> I'm going to need to do a large scale search-and-replace on a series of
> text files, using a sort of dictionary or hash-table of search terms and
> their replacement. The text files are filled to the usual fill column.
> The search terms may be broken across linebreaks, and I'm not sure of
> the best way to handle this. If it was regular English words I could
> probably manage a programmatic version of `isearch-toggle-word', but in
> this case these are solid strings, and might be broken anywhere.
>
> The two solutions I can think of are: 1) break up the characters in the
> search string and insert "\n?" between each one to create regexps to
> search on, and 2) unfill the whole file at the start of the procedure
> and then refill it afterwards. Neither of these seems like a great
> idea -- does anyone have any brighter ideas?
>
> Thanks,
> Eric
>
>
>

IMO Emacs Lisp is much better suited for that kind of tasks than sed, awk, Perl etc.
That's because you can cascade matching conditions nearly to infinite, while jumping around in buffer,
that way break complex regexps up, avoid them.

Write a function which first takes your files list

than

(while files-list

(while (re-search-forward 1nd-condition)
(while (re-search-forward 2nd-condition)

...

do-what-its-needed)))



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: search across linebreaks
  2013-02-17 15:52 ` Drew Adams
@ 2013-02-18  3:52   ` Eric Abrahamsen
  2013-02-18  4:01     ` Jambunathan K
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Abrahamsen @ 2013-02-18  3:52 UTC (permalink / raw)
  To: help-gnu-emacs

"Drew Adams" <drew.adams@oracle.com> writes:

>> I'm going to need to do a large scale search-and-replace on a 
>> series of text files, using a sort of dictionary or hash-table of 
>> search terms and their replacement. The text files are filled
>> to the usual fill column.  The search terms may be broken across
>> linebreaks, and I'm not sure of the best way to handle this.
>> If it was regular English words I could probably manage a
>> programmatic version of `isearch-toggle-word', but in
>> this case these are solid strings, and might be broken anywhere.
>> 
>> The two solutions I can think of are: 1) break up the characters
>> in the search string and insert "\n?" between each one to create
>> regexps to search on, and 2) unfill the whole file at the start
>> of the procedure and then refill it afterwards. Neither of these
>> seems like a great idea -- does anyone have any brighter ideas?
>
> What's not clear is whether any of the newline chars are significant.  From what
> you wrote I'm guessing no: they can all be ignored or just removed.  But in that
> case, filling would mean filling one big paragraph.
>
> Or perhaps consecutive newlines (\n\n) are significant, separating paragraphs?
> In that case, you could remove all newlines except one for each consecutive
> group (i.e., paragraph separation).
>
> Assuming no newlines are significant (or only one of consecutive ones is), the
> two solutions you propose sound reasonable to me.  Which of them to use might
> depend on size etc. - relative time to remove newlines and later refill vs the
> \n? regexp match time.

Thanks to all! Sed is something I've considered learning, but given its
learning curve, and the time I've already put into elisp, (and the fact
that I'm not even "supposed" to be a programmer in the first place!)
I'll probably go with an in-emacs solution.

For the unfill solution, I was thinking of actually running through the
file with fill-paragraph and a giant fill-column value, rather than just
deleting newlines, but I'm hesitating. These are org-mode files, and
fill-paragraph ought not to wreck them, but still...

This is a one-time bulk operation -- I'm translating a bunch of key
terms -- so the expense of the operation isn't that big a deal. The
consecutive newline question is a good one: definitely only one in a
row, but then there's potential indentation whitespace on the left...
I'll think I'll give this one a shot for now.

Thanks for the food for thought,

E




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: search across linebreaks
  2013-02-18  3:52   ` Eric Abrahamsen
@ 2013-02-18  4:01     ` Jambunathan K
  2013-02-18  6:09       ` Eric Abrahamsen
  0 siblings, 1 reply; 10+ messages in thread
From: Jambunathan K @ 2013-02-18  4:01 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: help-gnu-emacs


Eric

Your use-case - I understand it in only a fragmented way, from various
posts on Orgmode list - is quite unique.  You are a professional(?)
translator between Chinese and English.

I would recommend, that you talk about or document your use cases as a
bilingual(?) translator, in a publicly accessible place - say Emacswiki.

I think you should gather or share with all the various little snippets
that your .emacs is filled with as you go about your translation work.

Since you are a regular in Orgmode list, I am not sure how you will
treat a suggestion from me.  I am reputation is plain questionable.
That said, do consider my suggestion, FWIW.
-- 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: search across linebreaks
  2013-02-18  4:01     ` Jambunathan K
@ 2013-02-18  6:09       ` Eric Abrahamsen
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Abrahamsen @ 2013-02-18  6:09 UTC (permalink / raw)
  To: help-gnu-emacs; +Cc: kjambunathan

Jambunathan K <kjambunathan@gmail.com> writes:

> Eric
>
> Your use-case - I understand it in only a fragmented way, from various
> posts on Orgmode list - is quite unique.  You are a professional(?)
> translator between Chinese and English.

Yup, translation mixed with various publishing-related things.

> I would recommend, that you talk about or document your use cases as a
> bilingual(?) translator, in a publicly accessible place - say Emacswiki.
>
> I think you should gather or share with all the various little snippets
> that your .emacs is filled with as you go about your translation work.

Oh I'll definitely post anything I get working properly -- I'm not shy!
Just hampered by the fact that programming is a hobby only, and my elisp
skills are barely passing from "beginner" into "intermediate".

The two larger things I'm working on are work-related. One is a
translation environment built on top of Orgmode, that stores
translations in a TMX XML format, and provides a limited follow mode and
automatic translation. My question in this thread isn't about that,
exactly, but I'm hoping to learn some useful lessons from it. I wouldn't
be surprised if I'm the only one who ever uses this package.

The other is a major mode for creating and editing Epub ebooks. There
aren't many good free Epub tools out there, and I think emacs could be a
great environment for that.

Both of these will take me a very long time! Both, incidentally, are
currently bogged down in emacs' limited XML parsing/manipulation
abilities. But I'll definitely post something on the wiki once I have
some bits and pieces that work. I see my wiki profile is out of date,
maybe I'll start with that.

> Since you are a regular in Orgmode list, I am not sure how you will
> treat a suggestion from me.  I am reputation is plain questionable.
> That said, do consider my suggestion, FWIW.

Of course! It's a good suggestion, and you've been very helpful to me on
several occasions in the past. It seems that, thankfully, the curtain is
now drawn on the orgmode drama.

Eric




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: search across linebreaks
  2013-02-17  7:43 search across linebreaks Eric Abrahamsen
                   ` (3 preceding siblings ...)
  2013-02-17 17:05 ` Andreas Röhler
@ 2013-02-18 13:09 ` Nicolas Richard
  2013-02-19  1:22   ` Eric Abrahamsen
  4 siblings, 1 reply; 10+ messages in thread
From: Nicolas Richard @ 2013-02-18 13:09 UTC (permalink / raw)
  To: help-gnu-emacs

Eric Abrahamsen <eric@ericabrahamsen.net> writes:
> The two solutions I can think of are: 1) break up the characters in the
> search string and insert "\n?" between each one to create regexps to
> search on, and 2) unfill the whole file at the start of the procedure
> and then refill it afterwards. Neither of these seems like a great
> idea -- does anyone have any brighter ideas?

Not bright by any means, but slightly different from your solutions. The idea
is : save newlines as markers (except if two or more consecutive), and
restore afterwards.

(defun yf/test nil ""
  (let* (lom marker
	 (dict '(("foo bar" "foo barred")
		 ("foo baz" "foo bazzed")
		 ("foo foo" "foo fooed")))
	 (regexp (regexp-opt (mapcar 'car dict))))
    ;; replace single newlines by markers (recorded in a list of markers)
    (while (search-forward "\n" nil t)
      (if (looking-at "\n")
	  (skip-chars-forward "\n")
	(replace-match " ")
	(add-to-list 'lom (set-marker (make-marker) (point)))))
    (goto-char (point-min))
    ;; replace matches according to dict
    (while (re-search-forward regexp nil t)
      (replace-match (cadr (assoc (match-string 0) dict)) t t))
    ;; transform markers into newline again
    (while (setq marker (pop lom))
      (goto-char marker)
      (when (looking-at " ")
	(replace-match ""))
      (insert "\n"))))

There are many "areas for improvement" (aka bugs), e.g. it might be
necessary to allow more than just "\n" to be deleted/restored (I
imagine you could make `lom' into an alist of (marker . deleted-text)
and restore deleted-text instead of just inserting \n).

Test it with smth like:
(progn
  (insert
   "One two foo
bar three do
bar baz foo
baz for foo

bar baz foo
bar foo bar
foo bar foo
bar foo bar")
  (goto-char (point-min))
  (yf/test))

-- 
N.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: search across linebreaks
  2013-02-18 13:09 ` Nicolas Richard
@ 2013-02-19  1:22   ` Eric Abrahamsen
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Abrahamsen @ 2013-02-19  1:22 UTC (permalink / raw)
  To: help-gnu-emacs

"Nicolas Richard" <theonewiththeevillook@yahoo.fr> writes:

> Eric Abrahamsen <eric@ericabrahamsen.net> writes:
>> The two solutions I can think of are: 1) break up the characters in the
>> search string and insert "\n?" between each one to create regexps to
>> search on, and 2) unfill the whole file at the start of the procedure
>> and then refill it afterwards. Neither of these seems like a great
>> idea -- does anyone have any brighter ideas?
>
> Not bright by any means, but slightly different from your solutions. The idea
> is : save newlines as markers (except if two or more consecutive), and
> restore afterwards.
>
> (defun yf/test nil ""
>   (let* (lom marker
> 	 (dict '(("foo bar" "foo barred")
> 		 ("foo baz" "foo bazzed")
> 		 ("foo foo" "foo fooed")))
> 	 (regexp (regexp-opt (mapcar 'car dict))))
>     ;; replace single newlines by markers (recorded in a list of markers)
>     (while (search-forward "\n" nil t)
>       (if (looking-at "\n")
> 	  (skip-chars-forward "\n")
> 	(replace-match " ")
> 	(add-to-list 'lom (set-marker (make-marker) (point)))))
>     (goto-char (point-min))
>     ;; replace matches according to dict
>     (while (re-search-forward regexp nil t)
>       (replace-match (cadr (assoc (match-string 0) dict)) t t))
>     ;; transform markers into newline again
>     (while (setq marker (pop lom))
>       (goto-char marker)
>       (when (looking-at " ")
> 	(replace-match ""))
>       (insert "\n"))))

That's a pretty interesting solution, thank you! I don't use markers
much, but the basic idea of the marker -- a spot that remains immobile
relative to the text around it -- seems pretty applicable to my problem.

Thanks!
Eric




^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-02-19  1:22 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-17  7:43 search across linebreaks Eric Abrahamsen
2013-02-17 13:13 ` Jude DaShiell
     [not found] ` <mailman.20189.1361106838.855.help-gnu-emacs@gnu.org>
2013-02-17 14:43   ` J G Miller
2013-02-17 15:52 ` Drew Adams
2013-02-18  3:52   ` Eric Abrahamsen
2013-02-18  4:01     ` Jambunathan K
2013-02-18  6:09       ` Eric Abrahamsen
2013-02-17 17:05 ` Andreas Röhler
2013-02-18 13:09 ` Nicolas Richard
2013-02-19  1:22   ` Eric Abrahamsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).