How to perform Metaprogramming in Scala

Thomas Dickson

7 minute read



This post describes my solution to a contrived problem where I wanted to generate the case class necessary to load a dataset in Spark given a file describing the schema. The domain this problem falls in is called metaprogramming and I’ll provide a short overview of common problems it’s used to solve as well as provide some code for how I’ve written some code to do it with Scala.

Various solutions exist to generate code that helps with data engineering, website pages or maybe you just like being able to easily write your own tools to solve problems..

I wanted to learn more about metaprogramming in Scala as I’ve seen how useful it can be with other languages. My work intersects data engineering, machine learning and DevOps - one, more contrived, problem I’ve encountered is when I’m working with a new dataset in Spark but I don’t know what the schema is.

As mentioned, the problem domain is known as Metaprogramming, let’s see a handy definition from wikipedia wikipedia.

Metaprogramming is a programming technique in which computer programs have the ability to treat other programs as their data. It means that a program can be designed to read, generate, analyze or transform other programs, and even modify itself while running.

Scala uses the package Scalameta for metaprogramming. Like most Scala tooling it encourages a very “hands on” approach to learning, as in the API docs are thorough but there aren’t many higher level docs available. Scalameta is built around the principle that code can be translated into an Abstract Syntax Tree (AST) that then allows programmatic manipulation. AST Explorer that allows you to see in real time how Scala code can be translated into AST.

If you’ve worked with Scala before then you’ll be familiar with Scalafmt which is also built on top of Scalameta. An observation is that Scala can be used to write Domain Specific Languages (DSLs) for solving specific classes of problems. I’m making that connection here because coupling metaprogramming specifically with DSLs significantly increases the number of problems that can be solved.

Anyway, let’s start talking about the problem that I set myself. I wanted to generate a case class to mirror the schema for a given data source. Case Classes are important because they are immutable data structures written to be specific to a problem or domain. Often they need to be created to describe a specific data structure at the beginning of writing an ETL pipeline, so it’s important to automate the process a little bit as well as learn a bit more about how Scala parses the data structure itself.

Here is an overview of the solution process:

  1. Accept a csv file defining the data source schema defined in terms of name,type, where type is usually String but needs to be one of the Scala types.
  2. Generate a valid Scala case class in a file in the same directory as the data file.
  3. Apply formatting to the Scala case class

Here’s another way of describing this problem:

graph LR; A(Data file)-->B(Parse column/row headings
and data); B-->D(Generate case class); D-->E(Apply scalafmt); E-->F(Use case class
in code!)

A core principle when using Scala to solve problems is to lean on it’s expressive type system. Specifically how we can create our own Types in the form of Case Classes. The diagram below show’s how I’ve chosen to use three different data structures to base my solution around.

graph LR; A(Column Specification)-->B(Case Class) B-->C(Scala file)

The ColumnSpecification type holds the information that relates a given column name to its type. The somewhat confusingly named CaseClass type associates the name of the Case Class being created with a list of ColumnSpecification types. Finally, the ScalaFile type takes the Scala code generated using Scala meta from the CaseClass type as a Source and associates it with the name and path of the Scala file being generated.

The next stage is to turn the types holding information on the schema that we want to turn into a case class into an AST that can then be turned into actual Scala code. Each type can be represented as a Term type, here’s an example of how the data from a ColumnSpecification type is turned into a Term.Param parameter:

1
2
3
4
5
6
Term.Param(
      Nil,
      Term.Name(field.colName),
      Some(Type.Name(field.colType)),
      None
    )

I used a for-comprehension to create a Seq of Term.Params. For-comprehensions are syntactic sugar that allows us to perform computations on sequences - read more in the docs here. Like many topics in functional programming we can touch on deeper topics, such as why it’s handy to turn data structures into monads (this tutorial using haskell might help) and the like 1.

1
2
3
4
5
6
7
8
9
private def createFields(fields: Seq[ColumnSpecification]): Seq[Term.Param] =
    for {
        field <- fields
    } yield Term.Param(
        Nil,
        Term.Name(field.colName),
        Some(Type.Name(field.colType)),
        None
    )

The final method I implemented was to insert the fields within the overall AST of the Scala package. This method inserts the case class fields within the package used to hold the case class that I wanted to generate.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
  private def generateCode(caseClass: CaseClass): Source = {
    val packageName =
      Term.Select(Term.Name("metaprogramming"), Term.Name("generatedCode"))
    val name = Type.Name(caseClass.name)
    val fields = List(createFields(caseClass.fields).toList)
    val parameters = Ctor.Primary(
      Nil,
      Name(""),
      fields
    )
    val template = Template(Nil, Nil, Self(Name(""), None), Nil, Nil)
    Source(
      List(
        Pkg(
          packageName,
          List(
            Defn.Class(
              List(Mod.Final(), Mod.Case()),
              name,
              Nil,
              parameters,
              template
            )
          )
        )
      )
    )
  }

At this point I’m just going to point you to the entire script to see how these different methods tie together.

So, to summarise this post I hope I’ve been able to:

  1. Introduce the concept of metaprogramming and ASTs.
  2. Illustrate the logic for generating code in Scala.
  3. Provide a solution to an arbitrary problem: generating a case class based on the format provided in a file.
  1. For this specific example we can use a for-comprehension because Seq is monadic which means it implemens a pure and a flatMap method - I will have to point you back to google to read more about the topic of monads.