Re: Reinterpreting the compiler source code

From: Bruno Loff <bruno.loff-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: Ian Grant <ian.a.n.grant-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org>
Cc: Richard Stallman <rms-mXXj517/zsQ@public.gmane.org>,
	guile-devel-mXXj517/zsQ@public.gmane.org,
	Theo deRaadt <deraadt-T7FYYhErWq4AvxtiuMwx3w@public.gmane.org>,
	Markus Kuhn <Markus.Kuhn-kDbDZe0LBGWFxr2TtlUqVg@public.gmane.org>,
	Linus Torvalds <torvalds-3NddpPZAyC0@public.gmane.org>,
	lightning <lightning-mXXj517/zsQ@public.gmane.org>,
	schellr-EkmVulN54Sk@public.gmane.org
Subject: Re: Reinterpreting the compiler source code
Date: Thu, 4 Sep 2014 21:57:03 +0100	[thread overview]
Message-ID: <CAGOfsMgZE==zCC2OmEPtbH_ph1g1-AhWikioSdZJFep8crP3vQ@mail.gmail.com> (raw)
In-Reply-To: <CAKFjmdx4Zm2_HVEUre-PYvcZrm41gKv-2z_EsGehjv6NbVxBAw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

[-- Attachment #1.1: Type: text/plain, Size: 19677 bytes --]

Even if the threat you describe is real, why couldn't it be resolved in a
simpler way?

I'm thinking certificates and signatures...

I mean, for each distribution in each architecture, there is basically a
single build configuration which needs to be certified as "trojan-free".
And this is basically already implemented, provided that I get my compiler
package from a clean source which signed the package, all that is necessary
is that the clean source took special care in ensuring that their version
is in fact bug-free...

That solves 99% of the use-cases, when you use a standard compiler provided
by a clean source. For the remaining .99%, when you use a custom-built
compiler, say, you could start with a compiler from a clean source and use
that to compile your custom-build... No bug injection would happen then.

Wouldn't that be enough? Formal certification of the cleanliness of the
compiler (if I understood correctly and this is what you propose) seems
over-the-top to me.

Happy hacking :-)

On 4 September 2014 18:33, Ian Grant <ian.a.n.grant-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org> wrote:

> Perhaps people would enjoy reading about how one could inject trojan
> semantics into a compiler in such a way that they have some resilience
> to all but quite major structural changes to the target source code?
>
> This is not just for the purposes of frightening the birds. It will
> show how useful tools which automatically analyse and transform
> program syntax could be. Imagine this in a 'whole program' IDE, so
> that one could query certain classes of variable assignment in an
> Operating System kernel, say, and program global re-writes based on
> pattern matching of whole syntactic structures, rather than just the
> sub-lexical matching which regular expressions offer.
>
> Here is an implementation of a Prolog interpreter. The interpreter
> itself is only 88 lines of Standard ML. Most of it is written using
> list functionals like map, exists, find, partition etc. So it should
> be clear that it could be quite easily translated into a fairly
> efficient C program.
>
> (* (Apart from the != predicate) This is a transliteration of the
>    Prolog interpreter in John Harrison's Cambridge lecture notes. *)
>
> datatype term =
>    Fn of string * term list
>  | Var of string
>
> datatype outcome =
>    Yes of (string * term) list
>  | No
>
> local open GrammarSyntax
>    fun parser startsymbol =
>        let fun parse file stream lexbuf =
>           let val expr = startsymbol PrologLexer.Token lexbuf
>           in Parsing.clearParser();
>             expr
>           end handle exn => (Parsing.clearParser(); raise exn)
>        in parse end
>    fun processString pfn = fn s =>
>       pfn "string" s (Lexing.createLexerString s)
>    fun qlToString l =
>       let fun iter r [] = r
>             | iter r ((QUOTE s)::fs) = iter (r^s) fs
>             | iter r ((ANTIQUOTE s)::fs) = iter (r^s) fs
>       in iter "" l
>       end
>    val parserulesq = processString (parser PrologParser.File) o qlToString
>    val parsetermq = processString (parser PrologParser.OneTerm) o
> qlToString
>    fun fails s t = (printTree t;raise Fail (s^": no case."))
>    fun mkTerm (NonTerm("atom",
>                        [NonTerm("constant",
>                                 [Term(name)]),
>                         l as NonTerm("term-list",_)]))
>            = Fn(name,mkArgs l)
>      | mkTerm (NonTerm("atom",[NonTerm("constant",[Term(name)])]))
>            = Fn(name,[])
>      | mkTerm (NonTerm("atom",[NonTerm("variable",[Term(name)])]))
>            = Var(name)
>      | mkTerm (NonTerm("atom",[l as NonTerm("list",_)]))
>            = mkList(l)
>      | mkTerm t = fails "mkTerm" t
>    and mkList (NonTerm("list",[a as NonTerm("atom",_),
>                                l as NonTerm("list",_)]))
>            = Fn(".",[mkTerm a,mkList l])
>      | mkList (NonTerm("list",[a as NonTerm("atom",_)]))
>            = Fn(".",[mkTerm a,Fn("[]",[])])
>      | mkList (NonTerm("list",[a as NonTerm("atom",_),
>                                a' as NonTerm("atom",_)]))
>            = Fn(".",[mkTerm a,mkTerm a'])
>      | mkList (NonTerm("list",[]))
>            = Fn("[]",[])
>      | mkList t =  fails "mkList" t
>    and mkArgs (NonTerm("term-list",[a as NonTerm("atom",_),
>                                     l as NonTerm("term-list",_)]))
>            = (mkTerm a)::(mkArgs l)
>      | mkArgs (NonTerm("term-list",[a as NonTerm("atom",_)]))
>            = [mkTerm a]
>      | mkArgs t = fails "mkArgs" t
>    fun mkTerms (NonTerm("terms",[a as NonTerm("atom",_),
>                                  ts as NonTerm("terms",_)]))
>                  = (mkTerm a)::(mkTerms ts)
>      | mkTerms (NonTerm("terms",[a as NonTerm("atom",_)]))
>                  = [mkTerm a]
>      | mkTerms t =  fails "mkTerms" t
>    and mkRule (NonTerm("rule",[a as NonTerm("atom",_),
>                                ts as NonTerm("terms",_)]))
>                  = (mkTerm a, mkTerms ts)
>      | mkRule (NonTerm("rule",[a as NonTerm("atom",_)]))
>                  = (mkTerm a,[])
>      | mkRule t = fails "mkRule" t
>    and mkRules (NonTerm("rule-list",[l as NonTerm("rule-list",_),
>                                      r as NonTerm("rule",_)]))
>          = (mkRule(r))::(mkRules l)
>      | mkRules (NonTerm("rule-list",[r as NonTerm("rule",_)]))
>          = [mkRule(r)]
>      | mkRules t = fails "mkRules" t
> in
>    val rules = List.rev o mkRules o parserulesq
>    val goal = mkTerm o parsetermq
> end
>
> local
>    fun occurs_in x =
>       fn (Var y) => x = y
>        | (Fn(_,l)) => List.exists (occurs_in x) l
>    fun assoc i =
>       (Option.map (fn (_,v) => v)) o (List.find (fn (k,_) => k = i))
>    fun subs insts =
>      fn (tm as (Var y)) =>
>            (case assoc y insts of NONE => tm | SOME v => v)
>       | (Fn(s,l)) =>
>            Fn(s,List.map (subs insts) l)
>    fun augment1 theta (x,s) =
>       let val s' = subs theta s
>       in if occurs_in x s andalso not (s = Var(x))
>             then raise Fail "Occurs check."
>             else (x,s')
>       end
>    fun raw_augment p insts = p::(List.map (augment1 [p]) insts)
>    fun augment (v,t) insts =
>       let val t' = subs insts t
>       in case t'
>            of Var (w) =>
>                 if w <= v
>                   then if w = v
>                           then insts
>                           else raw_augment (v,t') insts
>                   else raw_augment (w,Var(v)) insts
>             | _ => if occurs_in v t'
>                       then raise Fail "Occurs check."
>                       else raw_augment (v,t') insts
>       end
>     fun itlist2 f =
>       let fun iter [] [] = (fn b => b)
>             | iter (h::t) (h'::t') = (fn b => f h h' (itlist2 f t t' b))
>             | iter _ _ = raise Fail "Arity."
>       in iter
>       end
>    fun unify tm1 tm2 insts =
>       case tm1
>         of Var(x) =>
>            (case assoc x insts
>               of NONE => augment (x,tm2) insts
>                | SOME tm1' => unify tm1' tm2 insts)
>          | Fn(f1,args1) =>
>            (case tm2
>               of (Var(y)) =>
>                  (case assoc y insts
>                     of NONE => augment (y,tm1) insts
>                      | SOME tm2' => unify tm1 tm2' insts)
>                | Fn(f2,args2) =>
>                    if f1 = f2 then itlist2 unify args1 args2 insts
>                               else raise Fail ("Constants: mismatch:
> "^f1^"<>"^f2))
>    fun rename s =
>       fn (Var v) => Var("~"^v^s)
>        | (Fn(f,args)) => Fn(f,List.map (rename s) args)
>    fun rename_rule s (conc,assums) =
>        (rename s conc,List.map (rename s) assums)
>    fun expand n rules insts goals =
>       let fun first f =
>                 (fn [] => raise Fail "No rules apply."
>                   | (h::t) => (f h handle Fail _ => first f t))
>           fun search rule =
>              if goals = []
>                 then insts
>                 else
>                    let fun eqgoal (Fn("!=",[a,b])) = not (a = b)
>                          | eqgoal _ = false
>                        val (conc,assums) = rename_rule (Int.toString n)
> rule
>                        val goal = hd goals
>                        val insts' = if eqgoal goal
>                                        then insts (* CHECK! This is
> probably wrong. *)
>                                        else unify conc goal insts
>                        fun occurs (v,_) = occurs_in v conc
>                                    orelse List.exists (occurs_in v) assums
>                        val (loc,glob) = List.partition occurs insts'
>                        val goals' = (List.map (subs loc) assums) @ (tl
> goals)
>                    in expand (n+1) rules glob goals'
>                    end
>       in
>          first search rules
>       end
> in
>    fun prolog rules goal =
>       let val insts = expand 0 rules [] [goal]
>       in Yes (List.filter (fn (v,_) => occurs_in v goal) insts)
>       end handle Fail _ => No
> end
>
> Here is an example of what it can do (a rather tired old one, but bear
> with us, because we will shortly reinterpret this as something
> more interesting):
>
> val () = load "Prolog";
>
> val () = Meta.quotation := true;
>
> val rs = Prolog.rules `
>    male(albert).
>    male(edward).
>
>    female(alice).
>    female(victoria).
>
>    father_of(albert,edward).
>    father_of(albert,alice).
>
>    mother_of(victoria,edward).
>    mother_of(victoria,alice).
>
>    parents(X,M,F):-mother_of(M,X),father_of(F,X).
>
>    sister_of(X,Y):-
>       female(X),
>       parents(X,M,F),
>       parents(Y,M,F).
>
>    brother_of(X,Y):-
>       !=(X,Y),
>       male(X),
>       parents(X,M,F),
>       parents(Y,M,F).
> `;
>
> This defines a database of facts, the so-called ground terms, which
> are those which don't contain any logical variables (which in Prolog
> are typically all-upper-case identifiers). Then there are _rules_
> which define relations (in the set-theoretical sense of the term, but
> here also in the genealogical sense).
>
> Note that the two-place predicate sister_of is indirectly defined in
> terms another rule, which defines the parents relation. But these 88
> lines of code will resolve the indirection and so they can correctly
> deduce, for example:
>
>    val g = Prolog.goal `sister_of(alice,edward)`
>    val r = Prolog.prolog rs g
>
> giving
>
>   val r = Prolog.Yes [] : outcome
>
> Now look at the database of facts, and note that it is really just an
> abstract syntax representation of a collection of statements in an
> arbitrary language (i.e. a language with undefined syntax). Therefore
> one can apply Prolog deductions, quite trivially, to sentences in a
> defined programming language such as C. Now the C compiler, which we
> imagine we are attacking, already has all the code to analyse and
> manipulate abstract syntax representations of C programs, because that
> is its job! So I hope people can see, without my having to install it
> into GCC 4.9, that one can easily implement a Prolog interpreter in
> very little code, which can decide a quite general class of predicates
> concerning a program, the structure of which is represented in the
> compiler as abstract syntax.
>
> Now note that the predicates need not have a fixed 'depth.' Here are
> two rules defining a path in an arbitrary directed graph, which is a
> non-wellfounded structure capable of representing almost anything,
> including the sentences of any language defined by a context-free
> grammar.
>
> Here the ground terms represent a graph in three disjoint parts, with
> two cycles.
>
> val rs'' = Prolog.rules `
>    edge(g,h).
>    edge(e,f).
>    edge(f,e).
>    edge(a,b).
>    edge(b,c).
>    edge(c,d).
>    edge(d,a).
>    path(A,B):-edge(A,B).
>    path(A,B):-edge(A,C),path(C,B).
> `;
>
> Prolog can determine whether or not there is a path from one vertex to
> another, no matter how far apart they may be:
>
>    val g2 = Prolog.goal `path(a,d)`
>    val r2 = Prolog.prolog rs'' g2
>
> It prints
>
>   val r2 = Prolog.Yes [] : outcome
>
> And it can tell which vertices are reachable from a given vertex:
>
>    val g3 = Prolog.goal `path(b,X)`
>    val r3 = Prolog.prolog rs'' g3
>
> The result is
>
>   val r3 = Prolog.Yes [("X", Prolog.Fn("c", []))] : outcome
>
> In this implementation it only prints the first such edge, but it
> could find them all.
>
> So it should not require any great stretching of people's credulity to
> see that a very few lines of intelligently written code can recognise
> large-scale structure in abstract syntax, even when quite extensive
> changes may have been made to the leaves.
>
> Now we need to show it could re-write those programs. This is easy,
> because in fact a C compiler already does this. It has to, because the
> syntax of C is not sufficient to determine what is a well-formed C
> program and what is not. Here is an example of programmed re-writing
> of arbitrary fragments of a C program. What this does is split a
> compound typedef statement (any compound typedef statement whatsoever)
> into separate parts:
>
>    val rewrite_typedef =
>          rewriteq `
>             ⟦declaration
>                ⌜w:declaration-specifiers⌝
>                (init-declarator-list
>                   ⌜x:init-declarator-list⌝
>                   ⌜y:init-declarator⌝)⟧ =
>                (declaration-list
>                   (declaration-list
>                      ⟦declaration ⌜w⌝ ⌜x⌝⟧)
>                   (declaration ⌜w⌝ (init-declarator-list ⌜y⌝)))`
>
> So taking this typedef:
>
>    val td = absyntq `typedef unsigned int ui, *pui, (*funcp) (int i);`
>
> we produce this rewritten abstract syntax with three separate
> typedef declarations:
>
> external-declarations:
>   external-declaration:
>     declaration-list:
>       declaration-list:
>         declaration-list:
>           declaration-list:
>             declaration:
>               declaration-specifiers:
>                 storage-class-specifier: typedef
>                 declaration-specifiers:
>                   type-specifier: unsigned
>                   declaration-specifiers:
>                     type-specifier: int
>               init-declarator-list:
>                 init-declarator:
>                   declarator:
>                     direct-declarator:
>                       identifier: ui
>           declaration:
>             declaration-specifiers:
>               storage-class-specifier: typedef
>               declaration-specifiers:
>                 type-specifier: unsigned
>                 declaration-specifiers:
>                   type-specifier: int
>             init-declarator-list:
>               init-declarator:
>                 declarator:
>                   pointer:
>                   direct-declarator:
>                     identifier: pui
>       declaration:
>         declaration-specifiers:
>           storage-class-specifier: typedef
>           declaration-specifiers:
>             type-specifier: unsigned
>             declaration-specifiers:
>               type-specifier: int
>         init-declarator-list:
>           init-declarator:
>             declarator:
>               direct-declarator:
>                 direct-declarator:
>                   declarator:
>                     pointer:
>                     direct-declarator:
>                       identifier: funcp
>                 parameter-type-list:
>                   parameter-list:
>                     parameter-declaration:
>                       declaration-specifiers:
>                         type-specifier: int
>                       declarator:
>                         direct-declarator
>                           identifier i
>
> The general system, capable of handling extremely complex rewrites
> like this, which can be nested and called (conditionally and/or
> recursively) one from another, with parameters, is implemented in
> about 400 lines of rather badly written Standard ML. It is not Prolog,
> but it uses a similar unification algorithm to identify certain
> large-scale structures independently of many details of particular
> instances, which can vary quite significantly.
>
> These programs, the Prolog interpreter and the Rewrite engine, compile
> to 15,533 and 24,535+28,944 bytes respectively of CAML bytecode. I
> would not expect a C implementation of similar algorithms to be much
> more than this. If anyone doubts someone's chances of effectively
> hiding this much code in a C compiler binary, then they should
> consider that on my system, between GCC 4.5 and GCC 4.9, the combined
> compiler binaries: gcc, cpp, cc1, as and ld grew from 10 MB to over 70
> MB. So I think it would not be too difficult to hide an extra 80k of
> object code in a binary. That represents an increase of around one
> tenth of one percent over the total 70MB (which I consider quite
> ridiculous, by the way. One possible explanation is that GCHQ and the
> NSA are playing core-wars in the GCC binaries.)
>
> Now those still reading this will appreciate that the rewrite engine,
> and the Prolog interpreter, are quite abstract in the sense that they
> interpret _arbitrary_ rules, and so they could easily be implemented
> in such a way that the rules can be changed after the fact, by coding
> new rules into the compiler's source code. The coding could be in
> comments, or it could be in the choice of variable names, or it could
> be in terms of the number of spaces appearing at the ends of lines
> which are certain successive prime multiples of lines apart. Then, by
> applying a publicly scrutinised patch to the compiler source, the
> penetrators could reprogram the trap door already in all the binaries
> of the previous versions of the new compiler, so that they recognise
> and effectively compromise the new compiler's source structure. And
> while they're at it, they may as well change the coding system used to
> transmit new rules, so that no more than one instance of any encrypted
> message ever exists.
>
> This is only the tip of the ice-burg; and the less evidence I see that
> people are properly addressing this 'issue', the more of these
> 'suggestions' I will publish over the next few days. And I think that
> after not so very long, people will be less inclined to make unfounded
> claims about the relative improbability of such an attack. I'm looking
> forward to it: some of them are really, really whacky! And needless to
> say, none of them are original. This last is just a standard instance
> of the teletype string trigger trap door that Karg and Schell
> described 40 years ago in the Multics Security Evaluation:
> Vulnerability Analysis. That document is well worth a careful
> read. The First World Information War started decades ago, we think,
> but some people apparently haven't cottoned-on yet.
>
> Happy hacking, everyone!
>
> Ian
>
>
> _______________________________________________
> Lightning mailing list
> Lightning-mXXj517/zsQ@public.gmane.org
> https://lists.gnu.org/mailman/listinfo/lightning
>
>

[-- Attachment #1.2: Type: text/html, Size: 24162 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Lightning mailing list
Lightning-mXXj517/zsQ@public.gmane.org
https://lists.gnu.org/mailman/listinfo/lightning