Compiler Architecture: Efficient Module Resolution for Context-Sensitive Grammars with Cyclic Dependencies

The JS++ programming language and compiler present unique engineering challenges. In my previous “Under the Hood” post, I discussed GLR parsing and disambiguation. Today, I will be discussing JS++ importing and the compiler architecture.

The JS++ import system efficiently supports features such as being decoupled from the file system and circular imports on an ambiguous context-sensitive grammar (CSG).

An example of tight coupling with the file system would be Java or ActionScript. If you define a class named utils.myapp.myclass, the file and directory structure would need to reflect this. This tight coupling presents problems during refactoring. If you change the name of a class or package/namespace, you would have to rename everything in the file system too. CommonJS, originally used by Node.js and RequireJS, suffered from a similar weakness.

When the import system is tightly coupled with the file system, resolving imports, modules, and files are an O(1) operation: just open the file based on its fully-qualified name. In JS++, resolving an import risks being an O(N2) operation where, for each import on each file, every other input file needs to be opened, parsed, and searched to find if a module was declared in the input file (or if the module was declared at all… in any input files).

In addition, JS++ faced an additional engineering challenge by enabling circular imports on an ambiguous grammar. Consider the following code:

Foo < bar > baz;

Is the above code a variable declaration to declare a variable named baz with type Foo<bar> or is it a comparison expression statement? Java and C# solve this ambiguity by simply restricting such “expression statements” from being statements to enable generic programming. However, JS++ inherited the JavaScript syntax, and it would not be ideal to have to break existing code and valid syntax that users already know.

The above code can be disambiguated if we know whether Foo is a generic class or not. If Foo is a generic class, the code is a variable declaration; otherwise, it is a comparison expression. This sounds simple in theory, but – in practice – what if Foo is declared in another module? Sure, you can import all modules first (and that comes with deeper questions such as what level of analysis or code generation do you want to perform prior to importing). However, the reality is not so simple. What if Foo is declared in another module, and we want to import it, but the other module has a circular dependency to the current file? Which one do we import first? How do we do all of this efficiently so that we don’t basically process a file more than once?

The JS++ compiler does not require you to specify the import and linking order during compilation. You can store all your *.jspp files in one directory and compile like so:

$ js++ .

There was no existing literature available. No existing language’s import system faced all the challenges that we faced (such as inheriting unorthodox semantics from JavaScript like function hoisting) or had the same design as we did. Internally, from an engineering perspective, we also wanted to reduce or eliminate “branching logic” in the compiler logic.

Thus, without the user specifying the import order, with circular imports on an ambiguous grammar being allowed, with no dependency on the file/folder structure, having to be compatible with JavaScript, and having to be as fast as possible (preferably without caching), how does the JS++ compiler do it?

Preface: The JS++ Import System

Users of JS++ can admire the ease of use. Unlike JavaScript, the JS++ import system is very simple:

A.jspp

module Utils.Strings
{
    bool isEmpty(string s) {
        return s == "";
    }
}

B.jspp

import Utils.Strings;

isEmpty("");    // true
isEmpty("abc"); // false

The JS++ import system has several advantages to JavaScript’s:

Syntax and Brevity

First of all, you may immediately notice the conciseness. Prior to ES6, JavaScript had no modules and you had to use ad-hoc import systems. In ES6, when you define a module, you must explicitly declare the items you want to export. In contrast, JS++ will automatically “export” all your modules. The next thing you will notice is that JS++ also automatically imports all module members. In JavaScript, you would need to manually specify the specific members you want imported (or use a wildcard).

Efficiency

While JS++ may automatically “export” everything and subsequently “import” everything, it is very efficient. In a process known as dead code elimination, JS++ will “eliminate” unused code from the final output. For example, if your module defines three functions A, B, and C and you only use function A, then B and C will not be compiled in the final output. In addition, if you import a module but never use anything from the module, the entire module will simply not be generated.

Simplicity

Another benefit of the JS++ module and import system is simplicity. Consider all of the “overloads” of the ECMAScript 6 import keyword:

import * as myModule from 'my-module';
import {myMember} from 'my-module';
import {foo, bar} from 'my-module';
import {reallyReallyLongModuleMemberName as shortName}
  from 'my-module';
import {
  reallyReallyLongModuleMemberName as shortName,
  anotherLongModuleName as short
} from 'my-module';
import 'my-module';
import myDefault from 'my-module';
import myDefault, * as myModule from 'my-module';
// myModule used as a namespace
import myDefault, {foo, bar} from 'my-module';
// specific, named imports

Source: MDN

All of the above syntaxes do something different in JavaScript. In contrast, JS++ has only one import statement syntax:

import moduleName;

Decoupling

I once received a user question about why we don’t specify file/directory structure and naming conventions. The reason is because we give you absolute liberty here, and this actually comes from the software architecture. If you want your code to reflect the file/folder structure (a la Java), you’re free to do that. If you want all your code in one folder, you’re free to do that. However, we do specify a naming convention for modules in our documentation here.

Step 1. Parsing and Symbol Table Construction

The JS++ project is composed of several projects: the compiler, the parser, and so on. Prior to adding the import and module keywords in JS++ 0.4.2, the only project within JS++ using a symbol table was the compiler (for type checking). We began by decoupling the symbol table and refactoring it into a separate project as both the parser and compiler would depend on the symbol table.

Beginning from JS++ 0.4.2, the symbol table construction starts at parse time. Since the JS++ compiler allows you to use an entire directory as input (and it will recursively find all *.jspp and *.jpp files), we start by reading and parsing all input files. While we are parsing the input files, we build the symbol table and mark the symbols that cannot be currently resolved using a temporary meta-symbol such as <RELOCATE> or <UNRESOLVED>.

This process especially applies to “partially-qualified” names in JS++. This is what partial qualification looks like:

import Utils.Strings;

// Partial Qualification:
isEmpty("");    // true
isEmpty("abc"); // false

// Full Qualification:
Utils.Strings.isEmpty("");    // true
Utils.Strings.isEmpty("abc"); // false

In the above code, the identifier isEmpty can come from anywhere. It can even come from a module requiring circular imports. However, rather than processing the imports at this time, we simply mark the isEmpty identifier as unresolved for now. In addition, this allows us to deal with strange semantics that we inherited from JavaScript such as hoisting. In the case of hoisting, we might mark an identifier as unresolved even if it comes from the same file but gets declared later.

At this stage, the types of operations that may cause a symbol table insertion are “declarations” such as:

  1. Variable Declaration
  2. Function Declaration
  3. Function Parameter
  4. Module Declaration
  5. Label Statements (e.g. when labelling ‘for’ loops)
  6. Catch parameter (from try-catch blocks)
  7. External Statement (external foo;)
  8. Class and Interface Declarations (but this was not done for v.0.4.2)

At this point, no type checking or analysis has been performed. We only identified whether a symbol has been resolved or unresolved.

Step 2. Symbol Resolution

Once we have the basic symbol table built, we need to find all the symbols that need to be resolved and perform symbol resolution. However, we cannot haphazardly resolve symbols based on some “global” symbol table. This will lead to branching logic.

Instead, each resolution is file- and import-sensitive. In other words, we don’t process a “global” symbol table and resolve symbols. We process at the file level. We go through a file, and start by importing all modules based on import statements. Since JS++ imports by identifiers, you can see how finding an associated module or file can now be efficiently done without forcing the user into a Java-like import system that is tightly coupled to the file system.

Keep in mind that the same module “identifier” can be used across multiple files in JS++ because modules are not overwritten – they are “extended.” Nevertheless, it’s still just a symbol table lookup. (While building the symbol table, we can just “extend” the module’s sub-members each time a member is declared in a different file and keep information about which file each individual member is located to use for error reporting.)

Once the symbol is “found”, we replace the type in the symbol table from some unknown or <UNRESOLVED> type to the type of the symbol, and – in JS++ 0.4.2 – we record the pointer to its AST node for code generation (since we did not have an object code or intermediate language yet). If a symbol could not be resolved or if it could be found in one of the files but it wasn’t explicitly imported, we raise an error for undefined symbol.

Other errors that we can discover at this stage might be ambiguous partial qualification or cross-file, cross-module duplicate member declarations such as:

// A.jspp:
module Foo { void bar() {} } // Error, duplicate in B.jspp

// B.jspp:
module Foo { void bar() {} } // Error, duplicate in A.jspp

Step 3. Type Checking and Semantic Analysis

Since all symbols have now been resolved, we can perform type checking and semantic analysis.

At this stage, we don’t need to perform analysis in any particular file order. Thus, we do not need a topological sort (plus there is the chance for circular dependencies) or anything fancy. The JS++ compiler just uses the user-inputted file order here, but any order is fine.

Step 4. Dead Code Elimination

Once all analysis has completed, we also have information on which functions were called, which modules were used, which classes were instantiated. This allows us to perform dead code elimination and other optimizations.

Conclusion

While the problem itself was difficult and entailed many complexities, I’m happy that we were able to create a simple and efficient software architecture.

It might be possible that Java was tightly-coupled with the directory structure because of the engineering and architectural challenges involved in a more complex import system. For example, it should be clear that a custom parser is required, and a parser generator cannot be used. In a more complex grammar, there may be grammatical ambiguities involved. For example, the JS++ grammar cannot be described with a CFG or one-token lookahead. However, all these complexities were essentially covered and considered in the architecture.

In addition, circular imports need to be handled in a manner that is both time and space efficient. This article demonstrated how JS++ handles this without caching, without running a file through the full compile (or even parse) process more than once, and so on.

The JS++ import system is only one aspect of the JS++ compiler architecture. Even in languages that support circular imports, such as C#, they did not have to deal with JavaScript’s history. If you are interested in the challenges we faced in building on top of the JavaScript language syntax, consider reading my first “Under the Hood” series article about grammatical disambiguation.

Compiler Architecture: GLR Parsing and Disambiguation

Today I want to talk about GLR parsing and the internals of the JS++ parser.

The Problem

In JS++, there is the potential for code to be “ambiguous”. For instance, consider the following example:

Foo<bar> baz;

There are two interpretations for the above statement:

1. A comparison operation: Foo is less than bar is greater than baz.
2. A variable declaration with type Foo<bar> (where Foo is a generic type with type argument bar)

Since JS++ is a superset of the JavaScript programming language, we would naturally expect the first case since JS++ inherited this from JavaScript. However, in order to achieve a concise syntax for generic types, we also need to consider how we can enable the second case.

Continue reading “Compiler Architecture: GLR Parsing and Disambiguation”