Blog
A reference project implementation for using S-Expression as DSL with a scala backend
If you are still curious and want to read a lot then proceed…
As developers our job is use to write code to solve business problems. General purpose programming languages like C++, Java, C#, Scala etc are our usual goto tools for achieving these solutions. As the name suggests these languages are general purpose hence too powerful with complex syntax and semantics for general business people to understand and hence they use us coders to get the dirty job done. In some cases the business users may want to drive system/solution behavior with something less complex than a programming language but more flexible than UI.
In such cases mostly we resort to exposing a subset of programming language with simpler syntax and semantics to give them appropriate flexibility and control without giving them the ability to shoot themselves on the foot as we developers have been doing for ages. Since this subset of a PL will be more specific to the problem domain these kind of languages are called Domain Specific languages.
Many times we write systems for people who are intermediate users for the end consumers e.g. business workflow engines will have implementors who understand the actual customer’s business processes and configure appropriate workflows while the developers of the workflow engine will expose DSLs/UI based tools to configure workflows. In such a system the implementors are the intermediate users between the actual consumer of the workflows and developers of the workflow engine.
There are two type of DSLs:
The prime motivation of this post is to illustrate/guide the reader on how to create a simple external DSL which can be exposed to the intermediate users depending on the problem domain.
External DSLs are fun to build but with time they tend to become complex and we will end up having a poorly implemented runtime of a stupidly designed programming language. This article says it better than anyone. We need to be sure to think through how the DSL we are writing will evolve and what would be the demands of users in the future and how good are they with restricted functionalities.
Also don’t use JSON/YAML/XML etc for writing DSLs they suck big time! Try one and let me know how it goes…
For simplicity sake (yes that’s how i escape from writing a mini-book, avoiding problems and prevent from exposing myself!) let us take the problem domain of a scientific calculator. Let us expose a DSL which can be used to calculate mathematical equations/formula that operate on numbers. e.g. Take the arithmetic expression below:
100 + (1000 - (200 * (1/20)))
Designing a good DSL is as hard as naming. There are multiple things to consider:
If your DSL is going to have more forms and variations in syntax and semantics then the parser is going to get complex eventually and we will step into the area of computer scientists from computer programmers which will be hard for many of us. Based on my experiences with multiple programming languages and DSLs over a period of time I find S-Expression having a good tradeoff between intuitive, less typing and maintaining a simple parser/grammar. So we will use S-Expresssions as the DSL for this post. The above example expression might transfer into
(+ 100 (- 1000 (* 200 (/ 1 20))))
I can hear you yelling "How the F*** is this simple?"
please wait until I show you the simplicity when it comes to writing a parser and extending this DSL to other domains.
Writing a parser is hard for mortals like us. So there are some good people in the world who have written parser generators that write parsers for us and all we need to do is write programmatic hooks that can handle the parsed tokens. Some of the commonly used parser generators are:
All are great, we will use ANTLR for this post. ANTLR is used by well established frameworks like Hibernate to parse H(S)QL. Yes! SQL is one the best DSLs out there!
Let us look at the grammar of S-Expressions. Every expression follows the uniform syntax of
(operator operand1 operand2 ... operandN)
the operands can again be nested expressions like
(operator1 operand1 (operator2 operand1 operand2...) operand3 ...)
There is a clear advantage of simplicity in following such a notation. For example take unary vs binary operators. In normal programming languages we will have + which is an arithmetic binary operator which adds two numbers and we will have a ++ which is an unary operator which operates on a single number
a + b
a++
a++ + b
Both expressions have different syntax hence different grammars and rules to handle. If parser encounters a + then previous and next token are operands, if it encounters a ++ symbol then previous token is the operand and so on. Consider the same in S-Expression
(+ a b)
(++ a)
(+ (++ a) b)
In this case we have two operators + and ++ with 2 and 1 arguments respectively but the syntax and rules for parsing is the same. Whether it is ++ or + just treat the next token as the operand. Of course parentheses are there, but as I said it is a good trade off between having a simpler parser and people get used to those )))
s with time. Our parser logic can be uniform in this case. Rules are simple.
(
is an expression to be evaluated(
is the operand which is a symbol)
indicates end of an expressionWe will use ANTLR version 4 for this post. ANTLR has the concept of GRAMMAR files which have their own syntax :-) to specify how to parse tokens of your language. So depending on the rules we saw earlier we can write a grammar file after learning the ANTLR specification syntax or being the lazy devs we are we can download the S-Expression ANTLR grammar file.
grammar sexpression;
sexpr
: item* EOF
;
item
: atom
| list
| LPAREN item DOT item RPAREN
;
list
: LPAREN item* RPAREN
;
atom
: STRING
| SYMBOL
| NUMBER
| DOT
;
STRING
: '"' ('\\' . | ~ ('\\' | '"'))* '"'
;
WHITESPACE
: (' ' | '\n' | '\t' | '\r') + -> skip
;
NUMBER
: ('+' | '-')? (DIGIT) + ('.' (DIGIT) +)?
;
SYMBOL
: SYMBOL_START (SYMBOL_START | DIGIT)*
;
LPAREN
: '('
;
RPAREN
: ')'
;
DOT
: '.'
;
fragment SYMBOL_START
: ('a' .. 'z') | ('A' .. 'Z') | '+' | '-' | '*' | '/' | '.'
;
fragment DIGIT
: ('0' .. '9')
;
I will try my best to explain what is done here.
Install ANTLR4 using the instructions 1 or instructions 2 or use homebrew
brew install antlr@4
Download the grammar file for S-Expressions.
In the downloaded folder run below command
antlr4 sexpression.g4
This will generate a list of files required for the parser. Default target is Java we an generate for multiple languages.
sexpression.interp
sexpression.tokens
sexpressionBaseListener.java
sexpressionLexer.interp
sexpressionLexer.java
sexpressionLexer.tokens
sexpressionListener.java
sexpressionParser.java
Those interested can go through all these files I will just touch upon the ones we will have to interact with.
Let us take a look at sexpressionBaseListener.java. This is the base class which we have to extend to build our Abstract Syntax Tree (AST)
We can see the class has empty methods for entry an exit of each node or token type we have defined in the grammar file.
/**
* This class provides an empty implementation of {@link sexpressionListener},
* which can be extended to create a listener which only needs to handle a subset
* of the available methods.
*/
public class sexpressionBaseListener implements sexpressionListener {
@Override public void enterSexpr(sexpressionParser.SexprContext ctx) { }
@Override public void exitSexpr(sexpressionParser.SexprContext ctx) { }
@Override public void enterItem(sexpressionParser.ItemContext ctx) { }
@Override public void exitItem(sexpressionParser.ItemContext ctx) { }
@Override public void enterList(sexpressionParser.ListContext ctx) { }
@Override public void exitList(sexpressionParser.ListContext ctx) { }
@Override public void enterAtom(sexpressionParser.AtomContext ctx) { }
@Override public void exitAtom(sexpressionParser.AtomContext ctx) { }
@Override public void enterEveryRule(ParserRuleContext ctx) { }
@Override public void exitEveryRule(ParserRuleContext ctx) { }
@Override public void visitTerminal(TerminalNode node) { }
@Override public void visitErrorNode(ErrorNode node) { }
}
So we need to have a handler context of some sort may be a stack of operators and operands and then handle each entry an exit to convert the operators into appropriate AST nodes and keep pushing and popping the stack.
There is a simpler alternative for this listener based logic using the famed visitor pattern. Since visitor takes care of handling the context it makes lives easier for us to handle the token to AST node conversion alone.
ANTLR generates visitor classes as well. We can force visitor based parser generation using command below:
antlr4 -visitor sexpression.g4
This will generate additional visitor classes as shown below:
sexpressionVisitor.java
sexpressionBaseVisitor.java
Lets take a look at sexpressionBaseVisitor.java
public class sexpressionBaseVisitor<T> extends AbstractParseTreeVisitor<T> implements sexpressionVisitor<T> {
@Override public T visitSexpr(sexpressionParser.SexprContext ctx) { return visitChildren(ctx); }
@Override public T visitItem(sexpressionParser.ItemContext ctx) { return visitChildren(ctx); }
@Override public T visitList(sexpressionParser.ListContext ctx) { return visitChildren(ctx); }
@Override public T visitAtom(sexpressionParser.AtomContext ctx) { return visitChildren(ctx); }
}
So now we are all good and can start implementing the visiting logic for each type of node.
We will build the backend for this DSL using Scala 2.12
Download and install SBT
Then run the below command:
sbt new scala/scala-seed.g8
This will create a simple scala project after providing relevant details.
Project will have below structure
.
├── README.md
├── build.sbt
├── project
│ ├── Dependencies.scala
│ └── build.properties
└── src
├── main
│ └── scala
│ └── example
│ └── Hello.scala
└── test
└── scala
└── example
└── HelloSpec.scala
Let us cleanup the default code and create packages for our own use
rm -Rf src/main/scala/example
rm -Rf src/test/scala/example
mkdir -p src/main/scala/in/simpledsl
mkdir -p src/main/java/in/simpledsl
Create a folder named antlr4 in our project root and copy the sexpression.g4 file downloaded earlier
The inside the antlr4 folder execute below command to generate parser classes
antlr4 -package in.simpledsl.parser -visitor *.g4
Add ANTLR dependencies to the build file. Edit the Dependencies.scala
to look like below:
import sbt._
object Dependencies {
lazy val scalaTest = "org.scalatest" %% "scalatest" % "3.0.5"
lazy val antlr4Runtime = "org.antlr" % "antlr4-runtime" % "4.7"
lazy val stringTemplate = "org.antlr" % "stringtemplate" % "3.2"
val projectDeps = Seq(antlr4Runtime, stringTemplate, scalaTest % Test)
}
Then add the projectDeps in build.sbt as show below:
import Dependencies._
ThisBuild / scalaVersion := "2.12.8"
ThisBuild / version := "0.1.0-SNAPSHOT"
ThisBuild / organization := "in.simple-dsl"
ThisBuild / organizationName := "simple-dsl"
lazy val root = (project in file("."))
.settings(
name := "simple-dsl",
libraryDependencies ++= projectDeps
)
Then copy all the generated files into java folder
mkdir -p ../src/main/java/in/simpledsl/parser
cp *.* ../src/main/java/in/simpledsl/parser
Invoke below command from project root to build and ensure all dependencies are good
sbt clean compile
Now that we know what type of expressions we are going to support let us start defining the AST nodes
SExpression is going to be the root of all our node types. So let us define a trait SExpression inside the package in.simpledsl.ast
package in.simpledsl.ast
trait SExpression {
def eval(): BigDecimal
}
Since the output of an expression evaluation is always a number in our domain of a scientific calculator we are using BigDecimal as the return type here.
Now let us define the primitive expression which is a number in our domain.
A Number expression just wraps a number i.e. a BigDecimal
within
case class NumExpr(value: BigDecimal) extends SExpression {
override def eval(): BigDecimal = value
}
Now let us define a slightly complex expression that is addition
case class PlusExpr(first: SExpression, second: SExpression) extends SExpression {
override def eval(): BigDecimal = first.eval() + second.eval()
}
Note the recursive nature of the plus expression where it accepts two arguments which are again 2 different s-expressions
Let us proceed to implement the basic arithmetic operators now.
case class MinusExpr(first: SExpression, second: SExpression) extends SExpression {
override def eval(): BigDecimal = first.eval() - second.eval()
}
case class MultiplyExpr(first: SExpression, second: SExpression) extends SExpression {
override def eval(): BigDecimal = first.eval() * second.eval()
}
case class DivideExpr(first: SExpression, second: SExpression) extends SExpression {
override def eval(): BigDecimal = first.eval() / second.eval()
}
Now we have all the basic expressions done let us look at how to implement a custom visitor class to convert the parsed tokens to these SExpression nodes.
package in.simpledsl
import in.simpledsl.ast.SExpression
import in.simpledsl.parser.sexpressionBaseVisitor
class SExpressionVisitor extends sexpressionBaseVisitor[SExpression] {
}
You can notice that we have specified SExpression as the type parameter for the base visitor class which implies the output of the parser is going to be of type SExpression which we can eval() and get the output for.
Now let us start overriding visitor methods, we will start from visitAtom
that it is inner most leaf node.
override def visitAtom(ctx: sexpressionParser.AtomContext): SExpression = {
if(ctx.NUMBER() != null) {
NumExpr(BigDecimal(ctx.NUMBER().getText))
} else if(ctx.SYMBOL() != null) {
val symbol = ctx.SYMBOL().getText
symbol match {
case "+" => PlusExpr(null, null)
case "-" => MinusExpr(null, null)
case "*" => MultiplyExpr(null, null)
case "/" => DivideExpr(null, null)
}
} else if(ctx.STRING() != null) {
NumExpr(BigDecimal(ctx.NUMBER().getText))
} else null
}
If it is a number or a string we convert the same to a simple NumExpr
node, if the operator is a symbol then we return a dummy or blank object of the correspnding complex node from here. We will use this dummy/blank object when we visit a list.
Note: There are ways to avoid using this dummy object but I am not choosing to dicusss them to keep this simple
Now let us implement the next critical operation visitList
the
ctx: sexpressionParser.ListContext
is an array of parsed nodes from the input string excluding the left and right nodes.
override def visitList(ctx: sexpressionParser.ListContext): SExpression = {
val op = visit(ctx.item(0))
op match {
case numExpr: NumExpr => numExpr
case _: PlusExpr => {
val first = visit(ctx.item(1))
val second = visit(ctx.item(2))
PlusExpr(first, second)
}
case _: MinusExpr => {
val first = visit(ctx.item(1))
val second = visit(ctx.item(2))
MinusExpr(first, second)
}
case _: MultiplyExpr => {
val first = visit(ctx.item(1))
val second = visit(ctx.item(2))
MultiplyExpr(first, second)
}
case _: DivideExpr => {
val first = visit(ctx.item(1))
val second = visit(ctx.item(2))
DivideExpr(first, second)
}
}
}
The line val op = visit(ctx.item(0))
will invoke the visitAtom
method we implemented earlier on the first argument of the list expression returning a suitable blank instance of the corresponding SExpression trait implementation. Instead of switching on the string value we are switching on the actual SExpression implementation type to recursively visit the remaining elements as per the specification.
There is one more method that needs to be overridden in the visitor to make it work
override def aggregateResult(aggregate: SExpression, nextResult: SExpression): SExpression = Option(nextResult).getOrElse(aggregate)
Aggregates the results of visiting multiple children of a node, the default implementation returns result of the last child visited. In our case it will be null since LPAREN is the last visited child so we are overriding this to return the non-null value of the two in given order.
Now let us see this in action. Let us create a Application class to run this.
package in.simpledsl
import in.simpledsl.parser.{sexpressionLexer, sexpressionParser}
import org.antlr.v4.runtime.{ANTLRInputStream, CommonTokenStream}
object Application {
def main(args: Array[String]): Unit = {
val expressionString =
"(+ 10 (- 200 2))"
val charStream = new ANTLRInputStream(expressionString)
val lexer = new sexpressionLexer(charStream)
val tokens = new CommonTokenStream(lexer)
val parser = new sexpressionParser(tokens)
val visitor = new SExpressionVisitor()
val expression = visitor.visit(parser.sexpr())
println(s"Expression:: $expression")
val result = expression.eval()
println(s"Result:: $result")
}
}
The above code prints:
Expression:: PlusExpr(NumExpr(10),MinusExpr(NumExpr(200),NumExpr(2)))
Result:: 208
Which is inline with our expectations, the expression can support any level of nesting now.
Now let us take a case of volume of a sphere with radius 10 the formula is 4/3 * pi * (10 ^ 3)
We can very well write this expression using the primitives we have as below:
(* (/ 4 3) (* 3.141 (* (* 10 10) 10)))
Running our code Application class with expression gives below output:
Expression:: MultiplyExpr(DivideExpr(NumExpr(4),NumExpr(3)),MultiplyExpr(NumExpr(3.141),MultiplyExpr(MultiplyExpr(NumExpr(10),NumExpr(10)),NumExpr(10))))
Result:: 4187.999999999999999999999999999999
Which is well and good but a normal user who has to use this formula frequently would find it hard. So if we decide to support sphere volume as a first class construct e.g.
(sphere-volume 10)
We can implement this using below changes:
Let us start with an expression
case class SphereVolumeExpr(radius: SExpression) extends SExpression {
override def eval(): BigDecimal = (BigDecimal(4) / BigDecimal(3)) * BigDecimal(3.141) * radius.eval().pow(3)
}
Next we need to plug this expression implementation in the visitor
override def visitAtom(ctx: sexpressionParser.AtomContext): SExpression = {
if(ctx.NUMBER() != null) {
NumExpr(BigDecimal(ctx.NUMBER().getText))
} else if(ctx.SYMBOL() != null) {
val symbol = ctx.SYMBOL().getText
symbol match {
case "+" => PlusExpr(null, null)
case "-" => MinusExpr(null, null)
case "*" => MultiplyExpr(null, null)
case "/" => DivideExpr(null, null)
case "sphere-volume" => SphereVolumeExpr(null)
case x => throw new UnsupportedOperationException(s"Operator $x is not supported")
}
} else if(ctx.STRING() != null) {
NumExpr(BigDecimal(ctx.NUMBER().getText))
} else null
}
override def visitList(ctx: sexpressionParser.ListContext): SExpression = {
val op = visit(ctx.item(0))
op match {
case numExpr: NumExpr => numExpr
case _: PlusExpr => {
val first = visit(ctx.item(1))
val second = visit(ctx.item(2))
PlusExpr(first, second)
}
case _: MinusExpr => {
val first = visit(ctx.item(1))
val second = visit(ctx.item(2))
MinusExpr(first, second)
}
case _: MultiplyExpr => {
val first = visit(ctx.item(1))
val second = visit(ctx.item(2))
MultiplyExpr(first, second)
}
case _: DivideExpr => {
val first = visit(ctx.item(1))
val second = visit(ctx.item(2))
DivideExpr(first, second)
}
case _: SphereVolumeExpr => {
val radius = visit(ctx.item(1))
SphereVolumeExpr(radius)
}
}
}
Now testing our Application class with (sphere-volume 10)
prints below
Expression:: SphereVolumeExpr(NumExpr(10))
Result:: 4187.999999999999999999999999999999
Testing with (sphere-volume (* (+ 3 2) 2))
also gives same results. So we have implemented a S-Expression parser and executor now! Note that sphere-volume
is a unary operator and +/-/*
etc are binary operators but we did not handle them differently and did not care because in S-Expression world everything is an operator/function with any number of arguments as appropriate. This unification IMHO is the advantage we get by choosing S-Expressions.
S-Expressions are a powerful and simple alternative compared to JSON/YAML/XML like languages used for configuration. The possibilities of using this are endless. We can build DSLs that can support do, let operators that can use variable assignments like below:
(do
(let "radius" (+ 5 (* 3 2)))
(sphere-volume (ref "radius")))
Big data pipelines
Suppose if we are driving a big data pipeline running on hadoop or spark or something else which needs to read data from parquet files, select then transform few fields and write to postgres we can use something like below:
(do
(let "employee-transform"
(transform
(field "employeeId" (select "EmployeeID"))
(field "employeeName"
(concat (select "FirstName")
(select "LastName")))
(field "dob" (parse-date "DateOfBirth"))))
(run
(source (parquet-file "/input/{CLIENT_ID}/datafiles" "employees.parquet"))
(ref "employee-transform")
(target (postgres "{JDBC_URL}" "employees_table"))))
Workflow engines:
Suppose we are designing a workflow engine with a workflow having series of tasks
(do
(task "upload-proposal" (upload-document "proposal.pdf"))
(task "validate-proposal" (validate-document "{proposalId}"))
(task "post-validate-action"
(match "{validationResult}"
(when "FAILED" (send-email "Your proposal has been rejected" "{customerEmail}"))
(otherwise (run-task "upload-to-salesforce"))))
(task "upload-to-salesforce" (salesforce-upload "{proposalId}"))
(run
(workflow "wf001"
(steps "upload-proposal"
"validate-proposal"
"post-validate-action"))))
Each and every operation like validation-document
/salesforce-upload
can be complex implementations but hidden from business or intermediate users. Variables enclosed in {}
can be resolved from a global context and so on.
We can write DSLs that support 100s of operators and constructs using this example, refactoring and organising or improving this is left to the reader’s imagination.
P.S If you are scared of maintaining a parser or feel this way of approaching DSL problem is an abomination then chill and simply use Clojure. You get S-Expressions on steroids, an internal DSL and more
(keep-calm (and (use 'clojure)))
Code used in this post is available as project in Github
Thank you for making it till the end, for any corrections drop an email to r a m [d o t] 6 3 8 3 [a t] g m a i l [d o t] c o m