all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: "Eric Pement" <pemente@northpark.edu>
Subject: Re: Q on call-process and grep
Date: 22 Dec 2005 11:29:02 -0800	[thread overview]
Message-ID: <1135279742.488106.326830@g43g2000cwa.googlegroups.com> (raw)
In-Reply-To: mailman.20189.1135274808.20277.help-gnu-emacs@gnu.org

Drew Adams wrote:

> I'm using native Emacs on Windows, and I use Cygwin [ ... ]

> If I do this from the command line or using Emacs command `grep', it works
> fine:
>
>   grep -i ",someword\\($\\|,\\)" "myfile"
>
> If, however, I do this, then some words are found and others (which are also
> present in myfile) are not found:
>
>   (call-process "grep" nil buf nil "-i" ",someword\\($\\|,\\)" "myfile")
>
> The same words are systematically found or not found. I haven't been able to
> figure out why this doesn't work for some words (only).

If it were me, I'd make a copy of the file, and then chop it into
smaller pieces where I can illustrate the problem in a manageable
length (say, 10 or 20 lines, but the fewer the better). The sed command

   sed -n 17,35p bigfile >smallfile

will print lines 17 to 35, inclusive, so you can do your testing. But
you say that some of the lines are quite long. So try this:

   awk '{print length($0)}' smallfile

to see how long is too long. If the lines are under 4000 chars, I'd
feel safe in guessing that line length isn't a problem. If you have
lines 20,000 chars or more, then I'd start thinking about the input.

Does each line in the problem set end in a CR/LF? I've had datafiles
that gave me bad data because somehow some lines ended with CR/LF,
others with CR/CR/LF, and others with CR only. How I got the problem
isn't relevant. But to normalize the input, try

   tr -d '\r' <smallfile | sed -n p >clean_smallfile

which should remove any extraneous CRs which might be causing
corruption and restore the line endings to your Cygwin default (Unix or
DOS, whichever you picked).

[ ... ]
> All characters are ASCII, I believe (how to check that?).

   Use tr to delete all the characters that are permissible or
expected, and whatever is left must be an unexpected character. Examine
the output with cat -A or od or your tool of choice. E.g.,

   tr -d '\n\r\t\40-\176' <infile >outfile

If it were me, I might wonder about embedded backspaces or carriage
returns in the text. Just a thought. Good luck on your hunting!

--
Eric Pement

       reply	other threads:[~2005-12-22 19:29 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <mailman.20189.1135274808.20277.help-gnu-emacs@gnu.org>
2005-12-22 19:29 ` Eric Pement [this message]
2005-12-22 21:11   ` Q on call-process and grep Drew Adams
2005-12-22 18:05 Drew Adams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1135279742.488106.326830@g43g2000cwa.googlegroups.com \
    --to=pemente@northpark.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.