Today I want to talk about GLR parsing and the internals of the JS++ parser.
The Problem
In JS++, there is the potential for code to be “ambiguous”. For instance, consider the following example:
Foo<bar> baz;
There are two interpretations for the above statement:
1. A comparison operation: Foo
is less than bar
is greater than baz
.
2. A variable declaration with type Foo<bar>
(where Foo
is a generic type with type argument bar
)
Since JS++ is a superset of the JavaScript programming language, we would naturally expect the first case since JS++ inherited this from JavaScript. However, in order to achieve a concise syntax for generic types, we also need to consider how we can enable the second case.
The Language Design and Why It Matters
Other JavaScript supersets use an ActionScript-like syntax:
var baz : Foo<bar>;
I have mentioned in the past that JavaScript supersets have a preference for this syntax because it is easier to parse. However, I never specified why. I also have a certain disdain for this syntax because it involves more typing. var
is unnecessary and redundant. You also have to unnecessarily reach for the shift key to type a colon. In C-style languages (where JavaScript inherits its syntactic style), variables are traditionally declared with their type. Compare the “colon syntax” with JS++ which uses a more traditional C-style syntax:
var foo : string = "abc"; // Other languages string foo = "abc"; // JS++
In JS++, you type less and there’s no need to reach for the shift key to declare variables. It might seem like a trivial matter, but variable declarations will be used so many times throughout a program that it’s critical to get this detail right. It shows we cared when we designed JS++. The “colon syntax” has the benefit of not creating ambiguity; the drawback is that it degrades the developer experience.
In a nutshell: the colon syntax is easier for the compiler writers and harder for the developers, and the traditional C syntax is harder for the compiler writers and easier for the developers.
Basic Disambiguation
We can achieve basic disambiguation with the following rule: If Foo
is declared as a generic class, we have a variable declaration; otherwise, we have a comparison operation.
Variable Declaration
class Foo<T> {} Foo<bar> baz;
Comparison Expressions
Foo<bar> baz;
Notice the only difference between the two snippets is the presence of a generic class declaration. We use a symbol table (built during parse time) to determine whether Foo
is a generic class. Furthermore, we use “unlimited lookahead” to determine if there is a need for disambiguation. Given these complexities, it should naturally follow that such a grammar cannot be achieved with parser generators like yacc/bison, which only support LL(1) and LALR(1) (one-token lookahead) context-free grammars.
However, this strategy has a drawback: it only works if the class was declared locally in the same file.
Imports and Cycles
JS++ is a multi-file, modular language. Therefore, a generic class can be declared in another file. It might seem like the solution would be obvious: build a symbol table for each individual file and import the symbol tables.
However, JS++ is tricky — not least because its intricate import system but because it allows cyclic imports. If a file A.jspp depends on B.jspp, and B.jspp in turn depends on A.jspp, how will we determine which to import first? Furthermore, we are faced with a troubling dilemma: if a generic class was declared in one of the dependencies, how do we parse? It means Foo<bar> baz;
would be completely ambiguous.
In JS++, the user is not required to specify an import order; the compiler will figure out the order to process the input files by itself; unfortunately, this import order can only be figured out after the input files have been parsed (because JS++ imports based on identifiers and not file paths).
This was almost the straw that broke the camel’s back.
Disambiguation with GLR Parsing
To solve our problem, we introduced GLR parsing. When faced with an ambiguous parse (is it generic or is it a comparison?), we parse both… but only if it’s ambiguous.
No, we do not parse the file twice with both interpretations. We simply parse the specific expression or statement under both interpretations. Additionally, we only parse both interpretations if it’s ambiguous. It’s not always ambiguous because the symbol might be defined locally and we’ll know we have a generic rather than a comparison right away.
In a post-processing step (after all files have been parsed and all symbol tables have been built), we can disambiguate whether it’s a variable declaration or a comparison expression. For parsers that build the symbol table during parsing, it may not be possible to “complete” the symbol table until the post-processing step. This is because it’s not possible to determine if c
in a<b> c
should be added to the symbol table until disambiguation occurs.
Advantages and Disadvantages
To ensure we design correctly before beginning implementation, we have consultants that review our architecture and designs. The consultants we hire typically have a PhD in computer science, and the particular consultant that we discussed this with is a contributor to the gcc compiler. He described several pros and cons for GLR parsing:
Advantages
- All valid syntax will be parsed eventually
- No additional restrictions on the language syntax
Disadvantages
- Performance penalties. Not only will you be required to parse both interpretations of the initial statement but also all expressions that depend on that statement. This can lead to “a whole wood of parse trees”.
- Certain early optimizations in the parse tree require a second run because they are not decidable for both alternatives.
Fortunately, in the case of JS++, the disadvantages could not occur and we were able to take full advantage of GLR parsing.
Optimization
In JS++’s case, we were able to optimize the GLR parsing so that it would not need to be performed if it was unnecessary. By leveraging unlimited lookahead, we can determine quite a bit:
Foo < bar && baz; // No GLR necessary Foo < bar > baz; // Yes Foo < int > foo; // No // etc.
Furthermore, in JS++, unlimited lookahead and GLR parsing would only need to be performed until the semicolon.
Bonus Section: Type Casts, Instantiations, and Function Calls
If you are only interested in GLR parsing in general, this section may be redundant and you can safely skip it; however, if you want to better understand JS++ under the hood, this can be useful. There is another case in JS++ where GLR parsing may be required. I’m a traditionalist, and I like the C syntax. As a consequence, we use the C-style syntax for explicit type casts (rather than an as
keyword and the like):
int foo = (int) bar;
Once again, this introduces ambiguities:
(foo)(bar);
In the above expression, are we calling a function foo
with argument bar
, or are we casting bar
to type foo
? Yet again, inheriting JavaScript’s syntax introduces complexity.
Fortunately, we are able to resolve the ambiguity in a similar way to variable declarations vs. comparison expressions. If foo
is a type, it’s a type cast. In all other cases, it’s a function call.
This should have been satisfactory. However, we were concerned with instantiation too. Since JavaScript allows instantiation without arguments, JS++ inherited this:
new Foo; // Instantiate 'Foo' with no arguments
This can create an entire array of complexities:
new a; // Instantiation new (a); // Instantiation new a(); // Instantiation new a(b); // Instantiation new a(b, c); // Instantiation new (a)(); // Instantiation new (a)(b); // Ambiguous new (a)(b, c); // Ambiguous new (a)()(); // Instantiation. Valid code: function a(){ return function() {}; } new a()(); new (a)(b)(); // Ambiguous new (a)(b, c)(); // Ambiguous new (a)(b)(c); // Ambiguous new (a)(b, c)(d, e); // Ambiguous new (a)()(b); // Instantiation new (); // Parse error new ()a; // Parse error new ()(a); // Parse error new ()(a)(b); // Parse error
The ambiguities in the above examples were much more difficult to resolve. We couldn’t disambiguate based on whether or not a “type” was present because instantiation (a ‘new’ expression) requires a type to be specified just as much as a type cast expression would. How we parse also depends on the order of operations.
The simple cop out would be to require all instantiations to specify explicit arguments. However, I felt this was too restrictive. Since JavaScript’s “native global objects” can be instantiated, code like this is all too common:
var d = new Date;
This introduces a tricky predicament. We didn’t want to prevent developers with a JavaScript code base from using JS++ without running a converter. There are corner cases where conversion is required, but if the syntax is in popular usage, we couldn’t (and shouldn’t) make a breaking change. However, I noticed a strange quirk in JavaScript:
function a(){ } new +a; // Uncaught SyntaxError: Unexpected token +
This happened for all the unary operators. Our lead engineer promptly looked up the ECMAScript specification, and it appears unary expressions are not allowed inside ‘new’ expressions. We were in luck. Problem solved. In JS++, as specified in the documentation, a type cast expression is a unary expression. Since we already disambiguated function calls vs. type casts, we had nothing else to worry about.
… And that’s how the type cast syntax came to JS++.
Conclusion
If done right, the GLR parsing strategy can be effective for introducing cleaner syntax. GLR parsing has its drawbacks and needs to be carefully considered with the rest of the language design and parser/compiler architecture. We hope this in-depth analysis provides greater insight to the internals of JS++ and enables you to better understand the nuances of the language.