Overview of the SML# compiler implemenation.

author
YAMATODANI Kiyoshi
version
$Id: OverviewByYamatodani.html,v 1.1 2006/03/05 07:49:41 ohori Exp $

1, Software development

Software development is process of translation between descriptions of the world, for the purpose of sharing a precise common understanding about the world among us and computers. Descriptions can include formal ones like mathematical logic, and informal ones such as natural language and diagram. We call these descriptions "software" as a whole.

Each of descriptions is based on some particular principle of recognition. In other words, it assumes that this principle governs the world in two aspects of space and time.

For example, we can perceive that the world is filled with objects that change their status in response to received messages. Alternatively, in another extreme view, everything in the world is function and they are fused by function application to generate a new existence.

Here, the "world" does not necessarily mean the whole of the world, the universe. Each development project targets a particular domain of the world.

In sum, software is a set of descriptions about the application domain in the world based on several recognition principles.


2, Programming language

Programming language is one of description methods which achieves both of practical applicability and formal accuracy in some balance. And it also embodies some particular recognition principle as its foundation. For example, Java assumes the world of "everything is object", and ML assumes "everything is function".


3, Compiler

Compilation is the last part of the translation process. Through compilation, a description from human's view is translated into another description from computer's view.

Compiler performs this translation of compilation. It accepts a source description and generates a destination description. It converts recognition principles and representations of these descriptions, while preserving the image they denote.

The source description is written in a programming language which is based on some characteristic principle as exemplified above. On the other end, characteristic of the description based on computer's recognition is that it is "sequential". Atomic operations are ordered in a row according to their execution sequence. Entities are represented as a sequence of bits and they are denoted by its position within a sequence of memory slots. Compiler must perform conversion between these principles.

Conversion of representations is also the task of compiler. At representation level, most of compilers are translators from sequence to sequence. They accept a sequence of characters as its input and generate a sequence of instructions as its output.


4, Core modules of SML# compiler

Compilation is performed by a series of three essential operations.

  1. accepts a source description, usually in a sequential form.
  2. converts recognition principle.
  3. generates a destination description, usually in a sequential form.

The SML# compiler implementation is composed of many modules. Among them, core modules which do the above operations are Parser, ANormalTransformer, Assembler and InstructionSerializer. These modules are invoked in the following sequence.

  1. Parser
  2. ANormalTransformer
  3. Assembler
  4. InstructionSerializer

Principle conversion is done by ANormalTransformer for time aspect and by Assembler for space aspect.

ANormalTransformer decomposes a source expression into a sequence of atomic computations aligned in their evaluation order. There are several methods to represent sequence of atomic computations explicitly, such as CPS(continuation passing style), SSA(single static assignment) and etc. Among them, the SML# compiler adopts A-normal form.

The source language ML assumes some kind of discrete world where entities can be referred to only by symbolic name. Assembler maps these entities to relative positions and regions in a sequence which is a model of the memory. Concretely speaking, Assembler determines locations in the memory where results of evaluation are stored and those where codes are placed.

Parser and InstructionSerializer perform representation conversion. Parser deserializes a sequence of characters into an abstract syntax tree. InstructionSerializer serializes a sequence of abstract instructions into a sequence of bytes.

The below is the sequence of core modules and forms of descriptions passed between them.

   character sequence
           |
        (Parser)
           |
   abstract syntax tree
           |
   (ANormalTransformer)
           |
      A-normal form
           |
      (Assembler)
           |
   abstract instruction
           |
  (InstructionSerializer)
           |
     bytes sequence


5, Compilation of high-level language

Compilation is just a part of software development process. Manual translations precede this automatic translation. Initial informal image about the world is formalized to a description in a formal language gradually through manual translation phases, such as requirements analysis, basic design, detail design and so on.

To extend the range covered by automatic compilation to eliminate manual works is one of main directions in research of programming language. A principal approach is to design programming languages closer to human's natural recognition. This may require invention of novel programming principle. Alternatively, more moderate and immediate approaches aim to reduce necessary volume of manual description by extending an existing language. Syntax sugar of most languages and type inference in ML are among them, for example.

Both approaches bring what we call "high-level language" which can make a positive effect on development process. But, this means also increase of the distance between the programming language and the destination language of compilation. Therefore, compiler for high-level language must perform complicated task. An approach to handle such complication is to break down the compilation process into tractable phases.

Those phases can be divided into two groups: front-end phases that precede principle conversion and back-end phases that follow principle conversion. Front-end phases are under the principle of the source language. Back-end phases are under the principle of the destination language, or runtime.


6, Front-end phases

The transformation from a character sequence to A-normal form is divided to several phases. In the SML# compiler implementation, these phases are implemented by modules from Parser to ANormalTransformer.

Parser generates an abstract syntax tree from a character sequence. ANormalTransformer performs the last part of transformation from a syntax tree to A-normal form. Phases between them perform preliminary transformations to A-normal form. Roughly speaking, these phases perform two kinds of transformations.

augmentation
They augment the syntax tree by annotating it with semantic informations. Semantic informations include reference information and type information. Reference information links each reference with a binding which it indicates. For example, a variable expression is annotated with the information about the value binding which the name of the variable points to. Type information is attached to each expression, and is used heavily by phases of front-end and back-end. And, a source code that is syntactically correct but semantically not correct is rejected at this process.
granulation
They increase the granularity of constituent terms of the syntax tree by replacing them with combinations of more "atomic" terms. Atomicity of term is defined depending on the definition of "atomicity" in A-normal form, which depends on the target language of compilation further. For example, a "case" expression is transformed into a combination of "compare-branch" expressions, if "compare-branch" is atomic but "case" is not atomic in the A-normal form. (The current SML# compiler implementation transforms "case" to "switch".)

The front-end phases are implemented by the following modules.

They are invoked in this sequence.

Elaborator performs a part of granulation which does not require semantic information.

TypeInferencer performs augmentation.

MatchCompiler and RecordCompiler perform granulations of particular kinds of expressions. MatchCompiler targets pattern matching expressions, RecordCompiler targets block manipulation expressions.

The bitmap compilation is a feature of the SML# compiler and runtime. It augments the source code with bitmap informations. Bitmap informations indicate runtime type of variables and fields of blocks. They are calculated from type informations annotating terms.

Finally, ANormalTransformer sequences the half-granulated syntax tree into A-normal form.


7, Back-end phases

Back-end phases follow front-end phases.

They calculate locations of variables and codes in the program. Locations are encoded to the offset from the stack top of the slot allocated for a variable, the distance between destination and source of execution jump, and so on, depending on the target architecture. For the following reasons, this conversion cannot be performed until the front-end phases finish the temporal aspect conversion.

And other transformations that depend on the detail of the target machine are also performed in the back-end phases.

The back-end phases in the SML# compiler perform the following translations.

The assemble maps atomic operations in the A-normal form to machine instructions, and performs the layout calculation. The serialize generates a binary sequence encoding machine instructions.

For implementation reason, these translations are divided to the following modules.


8, Auxiliary phases

Above mentioned phases are indispensable for implementing compilation. There are other phases that perform additional functions to give practical utility to the compiler.

8.1, Optimization

As mentioned earlier, programming language research has made efforts to increase the distance between the source language and the destination language, or machine language.

On the other end, because of requirement for efficient compilation and execution, there is a constraint to keep them close to each other. A practical programming language should achieve a balance between the abstraction matching its target domain and the efficiency in compilation and execution. And it is another direction of programming language research to develop compilation algorithms that achieve such efficiencies to relax that constraint.

                | <------ language design -----> |
source language |                                | machine language
                | --> language implementaion <-- |

In the SML# compiler implementation, phases mentioned above are designed with taking into consideration efficiency. Furthermore, there are phases committed specially to optimization.

More optimization phases are expected to be implemented in future.

8.2, Other auxiliary phases

The PrinterGenerator inserts into the program some codes that print the evaluation result in human-readable format.

The TypeCheck checks the correctness of intermediate representations that are generated by internal phases.


9, Interface modules

A compiler has to interact with external data stores to perform its functionality. It reads a character sequence from a data store, and emits a byte sequence into another store.

For traditional compiler, it was sufficient if it could read a source code from a file and put a machine code into another file. But modern compilers have to be invoked in various situations, and to access various data stores.

Compilation is just a part integrated in the whole development process. There is a trend toward seamless intermixing of phases in software life cycle. Compilation may occur during coding, test, deployment, and execution. Compiler may be invoked by terminal user, by IDE, by application server and so on. Source code may come from file, from memory, from another process and so on. Compiled code may go similarly.

Implementation of compiler should be designed to deal with such requirements.

The SML# compiler implementation contains interface modules. They mediate the compiler modules and the external stores.

Channel modules are abstractions of I/O operations on various data stores: memory, file, socket, terminal.

Session modules are abstractions of interaction with execution runtime.

Top module coordinates them and the compilation modules described above to a complete compiler as shown in the following diagram. It also implements interaction with user on terminal.

  user <--(Channel)--> Top <--> Session <--(Channel)--> runtime
                        |
                        |
               Parser, ..., InstructionSerializer


A.1, Tale of intermediate languages

Language Structure Name Directory
Abstract Syntax Trees Absyn compiler/absyn
Untyped Pattern Expressions (1) PatternCalc compile/patterncalc
Untyped Pattern Expressions (2) PatternCalcWithTvars compile/patterncalcWithTvars|
Typed Pattern Expressions(1) TypedCalc compile/typedcalc
Typed Pattern Expressions(2) TypedFlatCalc compile/typedflatcalc
Polymorphic Record Calculus RecordCalc compile/recordcalc
Typed Lambda Calculs TypedLambda compile/typedlambda
Typed Bitmap-Passing Calculus BUCCalc compile/buccalc
Typed A-Normal Form ANormal compile/anormal
SML# Symbolict Instruction Set SymbolicInstructions compile/symbolicInstructions
SML# Instruction Instructions instructions/compile/instructions
SML# Bytecode Language Instructions.cc(definitions in C) instructions/runtime/instructions

A.2, Tale of compilation steps.

Step description input output
1. Parsing the source program Absyn
2. Elaboration Absyn PattarnCalc
3. function definition optimization PatternCalc PatternCalc
4. user type variable evaluation PatternCalc PatternCalcWithTvars
5. type inference PatternCalcWithTvars TypedCalc
6. module conpilation TypedCalc TypedFlatCalc
7. match compilation TypedFlatCalc RecordCalc
8. record compilation RecordCalc TypedLambda
9. bitmap-passing compilation TypedLambda BUCCalc
10. anormalization BUCCCalc ANormal
11. linearlizer ANormal Symbolicinstructions
12. assemble Symbolicinstructions Instructions
13. serialize Instructions Instructions(in C)