Perhaps people would enjoy reading about how one could inject trojan
semantics into a compiler in such a way that they have some resilience
to all but quite major structural changes to the target source code?

This is not just for the purposes of frightening the birds. It will
show how useful tools which automatically analyse and transform
program syntax could be. Imagine this in a 'whole program' IDE, so
that one could query certain classes of variable assignment in an
Operating System kernel, say, and program global re-writes based on
pattern matching of whole syntactic structures, rather than just the
sub-lexical matching which regular expressions offer.

Here is an implementation of a Prolog interpreter. The interpreter
itself is only 88 lines of Standard ML. Most of it is written using
list functionals like map, exists, find, partition etc. So it should
be clear that it could be quite easily translated into a fairly
efficient C program.

(* (Apart from the != predicate) This is a transliteration of the
   Prolog interpreter in John Harrison's Cambridge lecture notes. *)

datatype term =
   Fn of string * term list
 | Var of string

datatype outcome =
   Yes of (string * term) list
 | No

local open GrammarSyntax
   fun parser startsymbol =
       let fun parse file stream lexbuf =
          let val expr = startsymbol PrologLexer.Token lexbuf
          in Parsing.clearParser();
            expr
          end handle exn => (Parsing.clearParser(); raise exn)
       in parse end
   fun processString pfn = fn s =>
      pfn "string" s (Lexing.createLexerString s)
   fun qlToString l =
      let fun iter r [] = r
            | iter r ((QUOTE s)::fs) = iter (r^s) fs
            | iter r ((ANTIQUOTE s)::fs) = iter (r^s) fs
      in iter "" l
      end
   val parserulesq = processString (parser PrologParser.File) o qlToString
   val parsetermq = processString (parser PrologParser.OneTerm) o qlToString
   fun fails s t = (printTree t;raise Fail (s^": no case."))
   fun mkTerm (NonTerm("atom",
                       [NonTerm("constant",
                                [Term(name)]),
                        l as NonTerm("term-list",_)]))
           = Fn(name,mkArgs l)
     | mkTerm (NonTerm("atom",[NonTerm("constant",[Term(name)])]))
           = Fn(name,[])
     | mkTerm (NonTerm("atom",[NonTerm("variable",[Term(name)])]))
           = Var(name)
     | mkTerm (NonTerm("atom",[l as NonTerm("list",_)]))
           = mkList(l)
     | mkTerm t = fails "mkTerm" t
   and mkList (NonTerm("list",[a as NonTerm("atom",_),
                               l as NonTerm("list",_)]))
           = Fn(".",[mkTerm a,mkList l])
     | mkList (NonTerm("list",[a as NonTerm("atom",_)]))
           = Fn(".",[mkTerm a,Fn("[]",[])])
     | mkList (NonTerm("list",[a as NonTerm("atom",_),
                               a' as NonTerm("atom",_)]))
           = Fn(".",[mkTerm a,mkTerm a'])
     | mkList (NonTerm("list",[]))
           = Fn("[]",[])
     | mkList t =  fails "mkList" t
   and mkArgs (NonTerm("term-list",[a as NonTerm("atom",_),
                                    l as NonTerm("term-list",_)]))
           = (mkTerm a)::(mkArgs l)
     | mkArgs (NonTerm("term-list",[a as NonTerm("atom",_)]))
           = [mkTerm a]
     | mkArgs t = fails "mkArgs" t
   fun mkTerms (NonTerm("terms",[a as NonTerm("atom",_),
                                 ts as NonTerm("terms",_)]))
                 = (mkTerm a)::(mkTerms ts)
     | mkTerms (NonTerm("terms",[a as NonTerm("atom",_)]))
                 = [mkTerm a]
     | mkTerms t =  fails "mkTerms" t
   and mkRule (NonTerm("rule",[a as NonTerm("atom",_),
                               ts as NonTerm("terms",_)]))
                 = (mkTerm a, mkTerms ts)
     | mkRule (NonTerm("rule",[a as NonTerm("atom",_)]))
                 = (mkTerm a,[])
     | mkRule t = fails "mkRule" t
   and mkRules (NonTerm("rule-list",[l as NonTerm("rule-list",_),
                                     r as NonTerm("rule",_)]))
         = (mkRule(r))::(mkRules l)
     | mkRules (NonTerm("rule-list",[r as NonTerm("rule",_)]))
         = [mkRule(r)]
     | mkRules t = fails "mkRules" t
in
   val rules = List.rev o mkRules o parserulesq
   val goal = mkTerm o parsetermq
end

local
   fun occurs_in x =
      fn (Var y) => x = y
       | (Fn(_,l)) => List.exists (occurs_in x) l
   fun assoc i =
      (Option.map (fn (_,v) => v)) o (List.find (fn (k,_) => k = i))
   fun subs insts =
     fn (tm as (Var y)) =>
           (case assoc y insts of NONE => tm | SOME v => v)
      | (Fn(s,l)) =>
           Fn(s,List.map (subs insts) l)
   fun augment1 theta (x,s) =
      let val s' = subs theta s
      in if occurs_in x s andalso not (s = Var(x))
            then raise Fail "Occurs check."
            else (x,s')
      end
   fun raw_augment p insts = p::(List.map (augment1 [p]) insts)
   fun augment (v,t) insts =
      let val t' = subs insts t
      in case t'
           of Var (w) =>
                if w <= v
                  then if w = v
                          then insts
                          else raw_augment (v,t') insts
                  else raw_augment (w,Var(v)) insts
            | _ => if occurs_in v t'
                      then raise Fail "Occurs check."
                      else raw_augment (v,t') insts
      end
    fun itlist2 f =
      let fun iter [] [] = (fn b => b)
            | iter (h::t) (h'::t') = (fn b => f h h' (itlist2 f t t' b))
            | iter _ _ = raise Fail "Arity."
      in iter
      end
   fun unify tm1 tm2 insts =
      case tm1
        of Var(x) =>
           (case assoc x insts
              of NONE => augment (x,tm2) insts
               | SOME tm1' => unify tm1' tm2 insts)
         | Fn(f1,args1) =>
           (case tm2
              of (Var(y)) =>
                 (case assoc y insts
                    of NONE => augment (y,tm1) insts
                     | SOME tm2' => unify tm1 tm2' insts)
               | Fn(f2,args2) =>
                   if f1 = f2 then itlist2 unify args1 args2 insts
                              else raise Fail ("Constants: mismatch:
"^f1^"<>"^f2))
   fun rename s =
      fn (Var v) => Var("~"^v^s)
       | (Fn(f,args)) => Fn(f,List.map (rename s) args)
   fun rename_rule s (conc,assums) =
       (rename s conc,List.map (rename s) assums)
   fun expand n rules insts goals =
      let fun first f =
                (fn [] => raise Fail "No rules apply."
                  | (h::t) => (f h handle Fail _ => first f t))
          fun search rule =
             if goals = []
                then insts
                else
                   let fun eqgoal (Fn("!=",[a,b])) = not (a = b)
                         | eqgoal _ = false
                       val (conc,assums) = rename_rule (Int.toString n) rule
                       val goal = hd goals
                       val insts' = if eqgoal goal
                                       then insts (* CHECK! This is
probably wrong. *)
                                       else unify conc goal insts
                       fun occurs (v,_) = occurs_in v conc
                                   orelse List.exists (occurs_in v) assums
                       val (loc,glob) = List.partition occurs insts'
                       val goals' = (List.map (subs loc) assums) @ (tl
goals)
                   in expand (n+1) rules glob goals'
                   end
      in
         first search rules
      end
in
   fun prolog rules goal =
      let val insts = expand 0 rules [] [goal]
      in Yes (List.filter (fn (v,_) => occurs_in v goal) insts)
      end handle Fail _ => No
end

Here is an example of what it can do (a rather tired old one, but bear
with us, because we will shortly reinterpret this as something
more interesting):

val () = load "Prolog";

val () = Meta.quotation := true;

val rs = Prolog.rules `
   male(albert).
   male(edward).

   female(alice).
   female(victoria).

   father_of(albert,edward).
   father_of(albert,alice).

   mother_of(victoria,edward).
   mother_of(victoria,alice).

   parents(X,M,F):-mother_of(M,X),father_of(F,X).

   sister_of(X,Y):-
      female(X),
      parents(X,M,F),
      parents(Y,M,F).

   brother_of(X,Y):-
      !=(X,Y),
      male(X),
      parents(X,M,F),
      parents(Y,M,F).
`;

This defines a database of facts, the so-called ground terms, which
are those which don't contain any logical variables (which in Prolog
are typically all-upper-case identifiers). Then there are _rules_
which define relations (in the set-theoretical sense of the term, but
here also in the genealogical sense).

Note that the two-place predicate sister_of is indirectly defined in
terms another rule, which defines the parents relation. But these 88
lines of code will resolve the indirection and so they can correctly
deduce, for example:

   val g = Prolog.goal `sister_of(alice,edward)`
   val r = Prolog.prolog rs g

giving

  val r = Prolog.Yes [] : outcome

Now look at the database of facts, and note that it is really just an
abstract syntax representation of a collection of statements in an
arbitrary language (i.e. a language with undefined syntax). Therefore
one can apply Prolog deductions, quite trivially, to sentences in a
defined programming language such as C. Now the C compiler, which we
imagine we are attacking, already has all the code to analyse and
manipulate abstract syntax representations of C programs, because that
is its job! So I hope people can see, without my having to install it
into GCC 4.9, that one can easily implement a Prolog interpreter in
very little code, which can decide a quite general class of predicates
concerning a program, the structure of which is represented in the
compiler as abstract syntax.

Now note that the predicates need not have a fixed 'depth.' Here are
two rules defining a path in an arbitrary directed graph, which is a
non-wellfounded structure capable of representing almost anything,
including the sentences of any language defined by a context-free
grammar.

Here the ground terms represent a graph in three disjoint parts, with
two cycles.

val rs'' = Prolog.rules `
   edge(g,h).
   edge(e,f).
   edge(f,e).
   edge(a,b).
   edge(b,c).
   edge(c,d).
   edge(d,a).
   path(A,B):-edge(A,B).
   path(A,B):-edge(A,C),path(C,B).
`;

Prolog can determine whether or not there is a path from one vertex to
another, no matter how far apart they may be:

   val g2 = Prolog.goal `path(a,d)`
   val r2 = Prolog.prolog rs'' g2

It prints

  val r2 = Prolog.Yes [] : outcome

And it can tell which vertices are reachable from a given vertex:

   val g3 = Prolog.goal `path(b,X)`
   val r3 = Prolog.prolog rs'' g3

The result is

  val r3 = Prolog.Yes [("X", Prolog.Fn("c", []))] : outcome

In this implementation it only prints the first such edge, but it
could find them all.

So it should not require any great stretching of people's credulity to
see that a very few lines of intelligently written code can recognise
large-scale structure in abstract syntax, even when quite extensive
changes may have been made to the leaves.

Now we need to show it could re-write those programs. This is easy,
because in fact a C compiler already does this. It has to, because the
syntax of C is not sufficient to determine what is a well-formed C
program and what is not. Here is an example of programmed re-writing
of arbitrary fragments of a C program. What this does is split a
compound typedef statement (any compound typedef statement whatsoever)
into separate parts:

   val rewrite_typedef =
         rewriteq `
            ⟦declaration
               ⌜w:declaration-specifiers⌝
               (init-declarator-list
                  ⌜x:init-declarator-list⌝
                  ⌜y:init-declarator⌝)⟧ =
               (declaration-list
                  (declaration-list
                     ⟦declaration ⌜w⌝ ⌜x⌝⟧)
                  (declaration ⌜w⌝ (init-declarator-list ⌜y⌝)))`

So taking this typedef:

   val td = absyntq `typedef unsigned int ui, *pui, (*funcp) (int i);`

we produce this rewritten abstract syntax with three separate
typedef declarations:

external-declarations:
  external-declaration:
    declaration-list:
      declaration-list:
        declaration-list:
          declaration-list:
            declaration:
              declaration-specifiers:
                storage-class-specifier: typedef
                declaration-specifiers:
                  type-specifier: unsigned
                  declaration-specifiers:
                    type-specifier: int
              init-declarator-list:
                init-declarator:
                  declarator:
                    direct-declarator:
                      identifier: ui
          declaration:
            declaration-specifiers:
              storage-class-specifier: typedef
              declaration-specifiers:
                type-specifier: unsigned
                declaration-specifiers:
                  type-specifier: int
            init-declarator-list:
              init-declarator:
                declarator:
                  pointer:
                  direct-declarator:
                    identifier: pui
      declaration:
        declaration-specifiers:
          storage-class-specifier: typedef
          declaration-specifiers:
            type-specifier: unsigned
            declaration-specifiers:
              type-specifier: int
        init-declarator-list:
          init-declarator:
            declarator:
              direct-declarator:
                direct-declarator:
                  declarator:
                    pointer:
                    direct-declarator:
                      identifier: funcp
                parameter-type-list:
                  parameter-list:
                    parameter-declaration:
                      declaration-specifiers:
                        type-specifier: int
                      declarator:
                        direct-declarator
                          identifier i

The general system, capable of handling extremely complex rewrites
like this, which can be nested and called (conditionally and/or
recursively) one from another, with parameters, is implemented in
about 400 lines of rather badly written Standard ML. It is not Prolog,
but it uses a similar unification algorithm to identify certain
large-scale structures independently of many details of particular
instances, which can vary quite significantly.

These programs, the Prolog interpreter and the Rewrite engine, compile
to 15,533 and 24,535+28,944 bytes respectively of CAML bytecode. I
would not expect a C implementation of similar algorithms to be much
more than this. If anyone doubts someone's chances of effectively
hiding this much code in a C compiler binary, then they should
consider that on my system, between GCC 4.5 and GCC 4.9, the combined
compiler binaries: gcc, cpp, cc1, as and ld grew from 10 MB to over 70
MB. So I think it would not be too difficult to hide an extra 80k of
object code in a binary. That represents an increase of around one
tenth of one percent over the total 70MB (which I consider quite
ridiculous, by the way. One possible explanation is that GCHQ and the
NSA are playing core-wars in the GCC binaries.)

Now those still reading this will appreciate that the rewrite engine,
and the Prolog interpreter, are quite abstract in the sense that they
interpret _arbitrary_ rules, and so they could easily be implemented
in such a way that the rules can be changed after the fact, by coding
new rules into the compiler's source code. The coding could be in
comments, or it could be in the choice of variable names, or it could
be in terms of the number of spaces appearing at the ends of lines
which are certain successive prime multiples of lines apart. Then, by
applying a publicly scrutinised patch to the compiler source, the
penetrators could reprogram the trap door already in all the binaries
of the previous versions of the new compiler, so that they recognise
and effectively compromise the new compiler's source structure. And
while they're at it, they may as well change the coding system used to
transmit new rules, so that no more than one instance of any encrypted
message ever exists.

This is only the tip of the ice-burg; and the less evidence I see that
people are properly addressing this 'issue', the more of these
'suggestions' I will publish over the next few days. And I think that
after not so very long, people will be less inclined to make unfounded
claims about the relative improbability of such an attack. I'm looking
forward to it: some of them are really, really whacky! And needless to
say, none of them are original. This last is just a standard instance
of the teletype string trigger trap door that Karg and Schell
described 40 years ago in the Multics Security Evaluation:
Vulnerability Analysis. That document is well worth a careful
read. The First World Information War started decades ago, we think,
but some people apparently haven't cottoned-on yet.

Happy hacking, everyone!

Ian