all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* Q on call-process and grep
@ 2005-12-22 18:05 Drew Adams
  0 siblings, 0 replies; 3+ messages in thread
From: Drew Adams @ 2005-12-22 18:05 UTC (permalink / raw)


I'm using native Emacs on Windows, and I use Cygwin for commands like
`grep'. The (large) file "myfile" has only comma-delimited lines of words
and phrases, like this:

  word1,word2,word phrase3,word4,word phrase4,word5

Some of the lines are quite long. Lines are of varying length (different
words and phrases). There are no TABs. There are no spaces either, except
between words in phrases (none before or after commas or at eol).

If I do this from the command line or using Emacs command `grep', it works
fine:

  grep -i ",someword\\($\\|,\\)" "myfile"

If, however, I do this, then some words are found and others (which are also
present in myfile) are not found:

  (call-process "grep" nil buf nil "-i" ",someword\\($\\|,\\)" "myfile")

The same words are systematically found or not found. I haven't been able to
figure out why this doesn't work for some words (only). I visited the file
with `find-file-literally' and removed all ^M's at eol. I checked that the
commas are normal commas, thinking that some might have a different encoding
or something, making the regexp miss. All characters are ASCII, I believe
(how to check that?). The problem also doesn't seem to be related to line
lengths or the positions of the target words in the lines.

When I do `C-h C RET' it says that the buffer (for the grepped file) has no
conversion (binary), that the defaults for subprocess I/O are
`undecided-dos' for decoding and `undecided-unix' for encoding, and that
process I/O with target-pattern "bash" uses coding systems `(raw-text-dos .
raw-text-unix)'. I don't know if some of that might be a problem (or how to
change it, if it is).

I've tried having the error output sent to a file (e.g. "errors") - but to
no avail:

  (call-process "grep" nil (list (get-buffer "*scratch*") "errors")
     nil "-i" ",someword\\($\\|,\\)" "myfile")

Grep apparently does not error (the return code is 1), and the output buffer
is always empty for some words, as is also the error file.

I get the same behavior in different versions of Emacs, so I'm no doubt
missing something (i.e. there is no bug, except in my understanding). Any
ideas? Any suggestions on how to debug this? Thx.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Q on call-process and grep
       [not found] <mailman.20189.1135274808.20277.help-gnu-emacs@gnu.org>
@ 2005-12-22 19:29 ` Eric Pement
  2005-12-22 21:11   ` Drew Adams
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Pement @ 2005-12-22 19:29 UTC (permalink / raw)


Drew Adams wrote:

> I'm using native Emacs on Windows, and I use Cygwin [ ... ]

> If I do this from the command line or using Emacs command `grep', it works
> fine:
>
>   grep -i ",someword\\($\\|,\\)" "myfile"
>
> If, however, I do this, then some words are found and others (which are also
> present in myfile) are not found:
>
>   (call-process "grep" nil buf nil "-i" ",someword\\($\\|,\\)" "myfile")
>
> The same words are systematically found or not found. I haven't been able to
> figure out why this doesn't work for some words (only).

If it were me, I'd make a copy of the file, and then chop it into
smaller pieces where I can illustrate the problem in a manageable
length (say, 10 or 20 lines, but the fewer the better). The sed command

   sed -n 17,35p bigfile >smallfile

will print lines 17 to 35, inclusive, so you can do your testing. But
you say that some of the lines are quite long. So try this:

   awk '{print length($0)}' smallfile

to see how long is too long. If the lines are under 4000 chars, I'd
feel safe in guessing that line length isn't a problem. If you have
lines 20,000 chars or more, then I'd start thinking about the input.

Does each line in the problem set end in a CR/LF? I've had datafiles
that gave me bad data because somehow some lines ended with CR/LF,
others with CR/CR/LF, and others with CR only. How I got the problem
isn't relevant. But to normalize the input, try

   tr -d '\r' <smallfile | sed -n p >clean_smallfile

which should remove any extraneous CRs which might be causing
corruption and restore the line endings to your Cygwin default (Unix or
DOS, whichever you picked).

[ ... ]
> All characters are ASCII, I believe (how to check that?).

   Use tr to delete all the characters that are permissible or
expected, and whatever is left must be an unexpected character. Examine
the output with cat -A or od or your tool of choice. E.g.,

   tr -d '\n\r\t\40-\176' <infile >outfile

If it were me, I might wonder about embedded backspaces or carriage
returns in the text. Just a thought. Good luck on your hunting!

--
Eric Pement

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: Q on call-process and grep
  2005-12-22 19:29 ` Q on call-process and grep Eric Pement
@ 2005-12-22 21:11   ` Drew Adams
  0 siblings, 0 replies; 3+ messages in thread
From: Drew Adams @ 2005-12-22 21:11 UTC (permalink / raw)


    If it were me, I'd make a copy of the file, and then chop it into
    smaller pieces where I can illustrate the problem in a manageable
    length (say, 10 or 20 lines, but the fewer the better). The sed command

       sed -n 17,35p bigfile >smallfile

There are over 30,000 lines.

    will print lines 17 to 35, inclusive, so you can do your testing. But
    you say that some of the lines are quite long. So try this:

       awk '{print length($0)}' smallfile

    to see how long is too long. If the lines are under 4000 chars, I'd
    feel safe in guessing that line length isn't a problem. If you have
    lines 20,000 chars or more, then I'd start thinking about the input.

I was hoping that I was missing something simple. You seem to be confirming
that I didn't miss anything obvious (to you) ;-).

The longest line is over 12,000 characters.

    Does each line in the problem set end in a CR/LF? I've had datafiles
    that gave me bad data because somehow some lines ended with CR/LF,
    others with CR/CR/LF, and others with CR only. How I got the problem
    isn't relevant. But to normalize the input, try

       tr -d '\r' <smallfile | sed -n p >clean_smallfile

    which should remove any extraneous CRs which might be causing
    corruption and restore the line endings to your Cygwin default (Unix or
    DOS, whichever you picked).

Did that on the complete original file. `ediff' shows no difference from the
original.

I tried using a small file - just a few lines of the original - no change.
Terms that can't be found still aren't; those that can be found still are.

    Use tr to delete all the characters that are permissible or
    expected, and whatever is left must be an unexpected character. Examine
    the output with cat -A or od or your tool of choice. E.g.,

       tr -d '\n\r\t\40-\176' <infile >outfile

Did that. outfile is empty, so I guess everything was ASCII.

    If it were me, I might wonder about embedded backspaces or carriage
    returns in the text. Just a thought. Good luck on your hunting!

My guess is that the line lengths and number of lines don't matter here,
because it works fine for other words, including 1) words in the longest
line and 2) words in the last line of the file. It's a mystery to me why it
doesn't work for certain words.

Thanks for your suggestions, though - they were good things to try, even if
I haven't yet solved the problem.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2005-12-22 21:11 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <mailman.20189.1135274808.20277.help-gnu-emacs@gnu.org>
2005-12-22 19:29 ` Q on call-process and grep Eric Pement
2005-12-22 21:11   ` Drew Adams
2005-12-22 18:05 Drew Adams

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.