unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* Checking if a file is binary (non-textual)
@ 2009-09-28 13:29 Nordlöw
  2009-09-28 14:34 ` Jeff Clough
  0 siblings, 1 reply; 4+ messages in thread
From: Nordlöw @ 2009-09-28 13:29 UTC (permalink / raw)
  To: help-gnu-emacs

What characters (bytes) should *not* be present in a text-file that
may contain variable-length unicode characters.
What does the unicode standard say about this?

The reason for asking:
I am working on a tool that unifies grep, tags-query-replace, occur,
etc.
And I really would like this tool to have some clever default
behaviour for determining how to present the search (grep) hit-context
for different file-types:
- textual files: show whole line (as grep and occur does)
- binary files: either no context just notify match (like grep) or
maybe all [a-zA-Z0-9_]* directly before or after hit
- ...

/Nordlöw


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Checking if a file is binary (non-textual)
  2009-09-28 13:29 Checking if a file is binary (non-textual) Nordlöw
@ 2009-09-28 14:34 ` Jeff Clough
  2009-09-28 15:49   ` Eli Zaretskii
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff Clough @ 2009-09-28 14:34 UTC (permalink / raw)
  To: help-gnu-emacs

From: Nordlöw <per.nordlow@gmail.com>
Date: Mon, 28 Sep 2009 06:29:28 -0700 (PDT)

> What characters (bytes) should *not* be present in a text-file that
> may contain variable-length unicode characters.
> What does the unicode standard say about this?

If all you care about is UTF-8 and you believe Wiki:

http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

If you want more detail, that article has links to the relevant
standards.

For what it's worth though, following these standards aren't really
going to help you sort out the "binary" files from the "text" files.
Sure, if software wants to encode "text" in a way that is readable to
the rest of the world, it needs to follow these standards.  But if the
software wants to store data in the file in a non-"text" format, it
can do whatever it wants, including popping out a "binary" file that
looks like perfectly valid UTF-8.

MS systems tried to solve this by having a bit in the filesystem entry
as a binary/text flag, but even that can't be trusted.

Anyway, hope I helped! :)

Jeff



----------
Author of the Genesys System
A "free" universal role-playing game.
http://www.chaosphere.com/genesys/ 




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Checking if a file is binary (non-textual)
  2009-09-28 14:34 ` Jeff Clough
@ 2009-09-28 15:49   ` Eli Zaretskii
  2009-09-28 16:50     ` Jeff Clough
  0 siblings, 1 reply; 4+ messages in thread
From: Eli Zaretskii @ 2009-09-28 15:49 UTC (permalink / raw)
  To: help-gnu-emacs

> Date: Mon, 28 Sep 2009 14:34:44 +0000
> From: Jeff Clough <jeff@chaosphere.com>
> 
> MS systems tried to solve this by having a bit in the filesystem entry
> as a binary/text flag, but even that can't be trusted.

Which bit is that?  AFAIK, the text/binary distinction is entirely on
the run-time library level, the filesystem and the low-level APIs to
the filesystem do not distinguish between text and binary.  But I'll
be glad to learn something new.




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Checking if a file is binary (non-textual)
  2009-09-28 15:49   ` Eli Zaretskii
@ 2009-09-28 16:50     ` Jeff Clough
  0 siblings, 0 replies; 4+ messages in thread
From: Jeff Clough @ 2009-09-28 16:50 UTC (permalink / raw)
  To: help-gnu-emacs

From: Eli Zaretskii <eliz@gnu.org>
Date: Mon, 28 Sep 2009 17:49:43 +0200

>> Date: Mon, 28 Sep 2009 14:34:44 +0000
>> From: Jeff Clough <jeff@chaosphere.com>
>> 
>> MS systems tried to solve this by having a bit in the filesystem entry
>> as a binary/text flag, but even that can't be trusted.
> 
> Which bit is that?  AFAIK, the text/binary distinction is entirely on
> the run-time library level, the filesystem and the low-level APIs to
> the filesystem do not distinguish between text and binary.  But I'll
> be glad to learn something new.

Oops!  This is what I get for not hacking this shit under Windows for
so long.  You are right.  As far as I can tell, this is a library
thing that simply translates certain bytes differently depending on
the mode you opened the file in.  I mis-remembered it as something at
the filesystem level.

I *knew* something was off because I remembered it couldn't be
trusted, and now I remember why.

Jeff



----------
Author of the Genesys System
A "free" universal role-playing game.
http://www.chaosphere.com/genesys/ 




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-09-28 16:50 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-09-28 13:29 Checking if a file is binary (non-textual) Nordlöw
2009-09-28 14:34 ` Jeff Clough
2009-09-28 15:49   ` Eli Zaretskii
2009-09-28 16:50     ` Jeff Clough

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).