Perhaps people would enjoy reading about how one could inject trojan semantics into a compiler in such a way that they have some resilience to all but quite major structural changes to the target source code? This is not just for the purposes of frightening the birds. It will show how useful tools which automatically analyse and transform program syntax could be. Imagine this in a 'whole program' IDE, so that one could query certain classes of variable assignment in an Operating System kernel, say, and program global re-writes based on pattern matching of whole syntactic structures, rather than just the sub-lexical matching which regular expressions offer. Here is an implementation of a Prolog interpreter. The interpreter itself is only 88 lines of Standard ML. Most of it is written using list functionals like map, exists, find, partition etc. So it should be clear that it could be quite easily translated into a fairly efficient C program. (* (Apart from the != predicate) This is a transliteration of the Prolog interpreter in John Harrison's Cambridge lecture notes. *) datatype term = Fn of string * term list | Var of string datatype outcome = Yes of (string * term) list | No local open GrammarSyntax fun parser startsymbol = let fun parse file stream lexbuf = let val expr = startsymbol PrologLexer.Token lexbuf in Parsing.clearParser(); expr end handle exn => (Parsing.clearParser(); raise exn) in parse end fun processString pfn = fn s => pfn "string" s (Lexing.createLexerString s) fun qlToString l = let fun iter r [] = r | iter r ((QUOTE s)::fs) = iter (r^s) fs | iter r ((ANTIQUOTE s)::fs) = iter (r^s) fs in iter "" l end val parserulesq = processString (parser PrologParser.File) o qlToString val parsetermq = processString (parser PrologParser.OneTerm) o qlToString fun fails s t = (printTree t;raise Fail (s^": no case.")) fun mkTerm (NonTerm("atom", [NonTerm("constant", [Term(name)]), l as NonTerm("term-list",_)])) = Fn(name,mkArgs l) | mkTerm (NonTerm("atom",[NonTerm("constant",[Term(name)])])) = Fn(name,[]) | mkTerm (NonTerm("atom",[NonTerm("variable",[Term(name)])])) = Var(name) | mkTerm (NonTerm("atom",[l as NonTerm("list",_)])) = mkList(l) | mkTerm t = fails "mkTerm" t and mkList (NonTerm("list",[a as NonTerm("atom",_), l as NonTerm("list",_)])) = Fn(".",[mkTerm a,mkList l]) | mkList (NonTerm("list",[a as NonTerm("atom",_)])) = Fn(".",[mkTerm a,Fn("[]",[])]) | mkList (NonTerm("list",[a as NonTerm("atom",_), a' as NonTerm("atom",_)])) = Fn(".",[mkTerm a,mkTerm a']) | mkList (NonTerm("list",[])) = Fn("[]",[]) | mkList t = fails "mkList" t and mkArgs (NonTerm("term-list",[a as NonTerm("atom",_), l as NonTerm("term-list",_)])) = (mkTerm a)::(mkArgs l) | mkArgs (NonTerm("term-list",[a as NonTerm("atom",_)])) = [mkTerm a] | mkArgs t = fails "mkArgs" t fun mkTerms (NonTerm("terms",[a as NonTerm("atom",_), ts as NonTerm("terms",_)])) = (mkTerm a)::(mkTerms ts) | mkTerms (NonTerm("terms",[a as NonTerm("atom",_)])) = [mkTerm a] | mkTerms t = fails "mkTerms" t and mkRule (NonTerm("rule",[a as NonTerm("atom",_), ts as NonTerm("terms",_)])) = (mkTerm a, mkTerms ts) | mkRule (NonTerm("rule",[a as NonTerm("atom",_)])) = (mkTerm a,[]) | mkRule t = fails "mkRule" t and mkRules (NonTerm("rule-list",[l as NonTerm("rule-list",_), r as NonTerm("rule",_)])) = (mkRule(r))::(mkRules l) | mkRules (NonTerm("rule-list",[r as NonTerm("rule",_)])) = [mkRule(r)] | mkRules t = fails "mkRules" t in val rules = List.rev o mkRules o parserulesq val goal = mkTerm o parsetermq end local fun occurs_in x = fn (Var y) => x = y | (Fn(_,l)) => List.exists (occurs_in x) l fun assoc i = (Option.map (fn (_,v) => v)) o (List.find (fn (k,_) => k = i)) fun subs insts = fn (tm as (Var y)) => (case assoc y insts of NONE => tm | SOME v => v) | (Fn(s,l)) => Fn(s,List.map (subs insts) l) fun augment1 theta (x,s) = let val s' = subs theta s in if occurs_in x s andalso not (s = Var(x)) then raise Fail "Occurs check." else (x,s') end fun raw_augment p insts = p::(List.map (augment1 [p]) insts) fun augment (v,t) insts = let val t' = subs insts t in case t' of Var (w) => if w <= v then if w = v then insts else raw_augment (v,t') insts else raw_augment (w,Var(v)) insts | _ => if occurs_in v t' then raise Fail "Occurs check." else raw_augment (v,t') insts end fun itlist2 f = let fun iter [] [] = (fn b => b) | iter (h::t) (h'::t') = (fn b => f h h' (itlist2 f t t' b)) | iter _ _ = raise Fail "Arity." in iter end fun unify tm1 tm2 insts = case tm1 of Var(x) => (case assoc x insts of NONE => augment (x,tm2) insts | SOME tm1' => unify tm1' tm2 insts) | Fn(f1,args1) => (case tm2 of (Var(y)) => (case assoc y insts of NONE => augment (y,tm1) insts | SOME tm2' => unify tm1 tm2' insts) | Fn(f2,args2) => if f1 = f2 then itlist2 unify args1 args2 insts else raise Fail ("Constants: mismatch: "^f1^"<>"^f2)) fun rename s = fn (Var v) => Var("~"^v^s) | (Fn(f,args)) => Fn(f,List.map (rename s) args) fun rename_rule s (conc,assums) = (rename s conc,List.map (rename s) assums) fun expand n rules insts goals = let fun first f = (fn [] => raise Fail "No rules apply." | (h::t) => (f h handle Fail _ => first f t)) fun search rule = if goals = [] then insts else let fun eqgoal (Fn("!=",[a,b])) = not (a = b) | eqgoal _ = false val (conc,assums) = rename_rule (Int.toString n) rule val goal = hd goals val insts' = if eqgoal goal then insts (* CHECK! This is probably wrong. *) else unify conc goal insts fun occurs (v,_) = occurs_in v conc orelse List.exists (occurs_in v) assums val (loc,glob) = List.partition occurs insts' val goals' = (List.map (subs loc) assums) @ (tl goals) in expand (n+1) rules glob goals' end in first search rules end in fun prolog rules goal = let val insts = expand 0 rules [] [goal] in Yes (List.filter (fn (v,_) => occurs_in v goal) insts) end handle Fail _ => No end Here is an example of what it can do (a rather tired old one, but bear with us, because we will shortly reinterpret this as something more interesting): val () = load "Prolog"; val () = Meta.quotation := true; val rs = Prolog.rules ` male(albert). male(edward). female(alice). female(victoria). father_of(albert,edward). father_of(albert,alice). mother_of(victoria,edward). mother_of(victoria,alice). parents(X,M,F):-mother_of(M,X),father_of(F,X). sister_of(X,Y):- female(X), parents(X,M,F), parents(Y,M,F). brother_of(X,Y):- !=(X,Y), male(X), parents(X,M,F), parents(Y,M,F). `; This defines a database of facts, the so-called ground terms, which are those which don't contain any logical variables (which in Prolog are typically all-upper-case identifiers). Then there are _rules_ which define relations (in the set-theoretical sense of the term, but here also in the genealogical sense). Note that the two-place predicate sister_of is indirectly defined in terms another rule, which defines the parents relation. But these 88 lines of code will resolve the indirection and so they can correctly deduce, for example: val g = Prolog.goal `sister_of(alice,edward)` val r = Prolog.prolog rs g giving val r = Prolog.Yes [] : outcome Now look at the database of facts, and note that it is really just an abstract syntax representation of a collection of statements in an arbitrary language (i.e. a language with undefined syntax). Therefore one can apply Prolog deductions, quite trivially, to sentences in a defined programming language such as C. Now the C compiler, which we imagine we are attacking, already has all the code to analyse and manipulate abstract syntax representations of C programs, because that is its job! So I hope people can see, without my having to install it into GCC 4.9, that one can easily implement a Prolog interpreter in very little code, which can decide a quite general class of predicates concerning a program, the structure of which is represented in the compiler as abstract syntax. Now note that the predicates need not have a fixed 'depth.' Here are two rules defining a path in an arbitrary directed graph, which is a non-wellfounded structure capable of representing almost anything, including the sentences of any language defined by a context-free grammar. Here the ground terms represent a graph in three disjoint parts, with two cycles. val rs'' = Prolog.rules ` edge(g,h). edge(e,f). edge(f,e). edge(a,b). edge(b,c). edge(c,d). edge(d,a). path(A,B):-edge(A,B). path(A,B):-edge(A,C),path(C,B). `; Prolog can determine whether or not there is a path from one vertex to another, no matter how far apart they may be: val g2 = Prolog.goal `path(a,d)` val r2 = Prolog.prolog rs'' g2 It prints val r2 = Prolog.Yes [] : outcome And it can tell which vertices are reachable from a given vertex: val g3 = Prolog.goal `path(b,X)` val r3 = Prolog.prolog rs'' g3 The result is val r3 = Prolog.Yes [("X", Prolog.Fn("c", []))] : outcome In this implementation it only prints the first such edge, but it could find them all. So it should not require any great stretching of people's credulity to see that a very few lines of intelligently written code can recognise large-scale structure in abstract syntax, even when quite extensive changes may have been made to the leaves. Now we need to show it could re-write those programs. This is easy, because in fact a C compiler already does this. It has to, because the syntax of C is not sufficient to determine what is a well-formed C program and what is not. Here is an example of programmed re-writing of arbitrary fragments of a C program. What this does is split a compound typedef statement (any compound typedef statement whatsoever) into separate parts: val rewrite_typedef = rewriteq ` ⟦declaration ⌜w:declaration-specifiers⌝ (init-declarator-list ⌜x:init-declarator-list⌝ ⌜y:init-declarator⌝)⟧ = (declaration-list (declaration-list ⟦declaration ⌜w⌝ ⌜x⌝⟧) (declaration ⌜w⌝ (init-declarator-list ⌜y⌝)))` So taking this typedef: val td = absyntq `typedef unsigned int ui, *pui, (*funcp) (int i);` we produce this rewritten abstract syntax with three separate typedef declarations: external-declarations: external-declaration: declaration-list: declaration-list: declaration-list: declaration-list: declaration: declaration-specifiers: storage-class-specifier: typedef declaration-specifiers: type-specifier: unsigned declaration-specifiers: type-specifier: int init-declarator-list: init-declarator: declarator: direct-declarator: identifier: ui declaration: declaration-specifiers: storage-class-specifier: typedef declaration-specifiers: type-specifier: unsigned declaration-specifiers: type-specifier: int init-declarator-list: init-declarator: declarator: pointer: direct-declarator: identifier: pui declaration: declaration-specifiers: storage-class-specifier: typedef declaration-specifiers: type-specifier: unsigned declaration-specifiers: type-specifier: int init-declarator-list: init-declarator: declarator: direct-declarator: direct-declarator: declarator: pointer: direct-declarator: identifier: funcp parameter-type-list: parameter-list: parameter-declaration: declaration-specifiers: type-specifier: int declarator: direct-declarator identifier i The general system, capable of handling extremely complex rewrites like this, which can be nested and called (conditionally and/or recursively) one from another, with parameters, is implemented in about 400 lines of rather badly written Standard ML. It is not Prolog, but it uses a similar unification algorithm to identify certain large-scale structures independently of many details of particular instances, which can vary quite significantly. These programs, the Prolog interpreter and the Rewrite engine, compile to 15,533 and 24,535+28,944 bytes respectively of CAML bytecode. I would not expect a C implementation of similar algorithms to be much more than this. If anyone doubts someone's chances of effectively hiding this much code in a C compiler binary, then they should consider that on my system, between GCC 4.5 and GCC 4.9, the combined compiler binaries: gcc, cpp, cc1, as and ld grew from 10 MB to over 70 MB. So I think it would not be too difficult to hide an extra 80k of object code in a binary. That represents an increase of around one tenth of one percent over the total 70MB (which I consider quite ridiculous, by the way. One possible explanation is that GCHQ and the NSA are playing core-wars in the GCC binaries.) Now those still reading this will appreciate that the rewrite engine, and the Prolog interpreter, are quite abstract in the sense that they interpret _arbitrary_ rules, and so they could easily be implemented in such a way that the rules can be changed after the fact, by coding new rules into the compiler's source code. The coding could be in comments, or it could be in the choice of variable names, or it could be in terms of the number of spaces appearing at the ends of lines which are certain successive prime multiples of lines apart. Then, by applying a publicly scrutinised patch to the compiler source, the penetrators could reprogram the trap door already in all the binaries of the previous versions of the new compiler, so that they recognise and effectively compromise the new compiler's source structure. And while they're at it, they may as well change the coding system used to transmit new rules, so that no more than one instance of any encrypted message ever exists. This is only the tip of the ice-burg; and the less evidence I see that people are properly addressing this 'issue', the more of these 'suggestions' I will publish over the next few days. And I think that after not so very long, people will be less inclined to make unfounded claims about the relative improbability of such an attack. I'm looking forward to it: some of them are really, really whacky! And needless to say, none of them are original. This last is just a standard instance of the teletype string trigger trap door that Karg and Schell described 40 years ago in the Multics Security Evaluation: Vulnerability Analysis. That document is well worth a careful read. The First World Information War started decades ago, we think, but some people apparently haven't cottoned-on yet. Happy hacking, everyone! Ian