Saturday, August 25, 2012

JSON serialization with reflection in Scala! Part 2

An Apology

First off, I'd like to apologize to all for taking so long to follow up on my last post. Believe it or not, I only posted that one once this one was written, but... I waited for a couple of bugs to be fixed, and then the API changed, and there were new bugs, and so on. Even as I write this, with 2.10.0-M7 out and a good likelihood of it being followed by RC1, I still uncovered a Blocker bug on reflection!

Anyway, I wanted to avoid teaching people things that would turn out not to be true, and, so, I delayed. I think it's time, now to go ahead with this post. As soon as I finish this preamble, I'll revise the post and the code, and I hope to catch all problems. If you find anything amiss, please notify me that I can fix it the sooner!

A Warning

I'm writing this with 2.10.0 ready but not yet announced, and I've just became aware that, at the moment, reflection is not thread-safe. See issue 6240 for up-to-date information.

Recap

We saw on part 1 that Scala 2.10 will have a reflection library, and baited you into reading all post with the promise of a serializer written with it. That library is divided into layers through the cake pattern, which is used to provide increasingly more detailed levels from very basic reflection all the way to what the compiler needs. Those layers are called Universes.

We saw the base and runtime Universe, and we looked at some code which made use of them. We did not, however, go into any detail about what's inside these layers, and how those things relate to each other, and how to manipulate them.

So get ready, because we are getting into that right now!

What's in a Universe?

Reflection is available at different layers, and layers are Universes, so what's in a Universe? Quite a lot, actually. I invite anyone to look at the inheritance list of a typical universe, such as scala.reflect.runtime.JavaUniverse. Now look at scala.reflect.api.JavaUniverse, which is the type you actually see when you use scala.reflect.runtime.universe, and see how much smaller it is. That will make you feel better about how much stuff is still in there. :-)

A Universe has pretty much every kind of member one can have in Scala: type, trait, class, object, val, lazy val, def... I haven't seen a var, fortunately. But I can't guarantee there isn't one either! There is so much stuff in them that browsing will look like a daunting activity. But there's some logic to what's in there, which I'd like to go into before we get into actual details.

The core of Universe are its type members. The type type members are usually defined as boundaries, which enable the different layers to make them different things while still exposing a rich API. There are also classes and traits declaring those APIs or implementing them.

One pattern you'll see often enough is a type X, with a trait XApi, and possibly a class XBase. The logic is the following: X is used in the declarations of methods, and it is sometimes declared as a type with upper and lower bounds. The traits by the name XApi, often used as upper bounds on X, refer to declarations on scala.reflect.api, the full fledged reflection provided by scala-reflect.jar. XBase refers to things declared at scala.reflect.base, the minimal implementation provided on the standard library.

You'll often see class XExtractor as well. That contains apply and unapply methods which enables its instances to be used to create instances of X, or pattern match them. For each XExtractor, you'll find a val X with type XExtractor. On scala.reflect.api, classes will usually be abstract class. Elsewhere, you may find concrete implementations.

And speaking of value members, beside those extractors I just mentioned, you'll often see YTag, which are constants for ClassTag, and are used to preserve identities of abstract types from erasure, so that pattern matching works correctly. Sometimes you'll also see YTpe, Type constants, which expose core types.

There are few methods. Their main use seems to be creating new stuff, though some of them find stuff through implicit values (most prominently, typeOf). There's also some helper methods, to let you interactively discover what you have.

Basic Concepts: What's X?

Ok, there's a lot of X in there, and, to be honest, I don't know much about them. However, most of them are subtypes of one of a few main concepts, or repetitions following the type/api/base pattern I mentioned above. I'll quote again from Scala Reflect Pre-SIP, linked to their definitions on scala.reflect.api.JavaUniverse:

NameDescription
SymbolTypes representing definitions
TypeTypes representing types
TreeTypes representing abstract syntax trees
NameTypes representing term- and type-names
AnnotationInfoTypes representing annotations
PositionTypes representing positions of tree nodes in source files
FlagSetType representing sets of flags that apply to symbols and definition trees
ConstantType representing compile-time constants

Let's start with the first one. Anything you define in Scala has a Symbol. If you give something a name, then it has a Symbol associated with it. If you didn't give it a name, but you could have, then it has a Symbol.

A Symbol has a Name, a Type, an owner (from which one can extract some other relationships, such as enclosing method, class and package), possibly a companion, and some flags with a bunch of information about it. Of particular relevance inside the compiler, a Symbol is mutable (see method setFlags). However, at run time it is immutable.

I'll skip briefly ahead and talk about Name. Scala has two namespaces: values and types. So a single String can refer to different things according to the namespace. All names, however, are either TermName (values) or TypeName (types), which makes it possible to use a Name anywhere required without having to specify to which namespace that name belongs.

That's why, for example, I did not have to say I was looking for a value member called head, instead of a type member of that name, in the example on part 1.

Names also abstract their representation on a classfile, where they have to be encoded to be considered valid identifiers.

Type is where most of the reflective action is, in my opinion. A type is not, say, Int -- that's just its Symbol (assuming we are talking about scala.Int, and not just the name). A Type is the information about all members that compose that thing: methods, fields, type parameters, nested classes and traits, etc. If a Symbol represents a definition, a Type represents the whole structure of that definition. It is the union of all definitions that compose a class, the description of what goes into a method and what comes out, etc.

Some types are simpler than others, of course. On the example at part 1, we looked at the three types that can represent a method. Other types, such as those for traits and classes, will inevitably be much more complex. We saw that too, as we searched inside the type intMethods received, looking for the methods we wanted.

And, because Type can be so many different things, you'll often have to pattern match against a specific subclass of Type, or its extractor, also as seen on that example. Types are immutable, though note that, inside the compiler, the symbols it contains are mutable.

A Tree is the compiler's internal representation for source code. It is what is usually called an Abstract Syntax Tree, or AST. A Tree, like Type, is composed of many different subclasses, which together represent everything that has semantic meaning on Scala source code.

Like names, a Tree can be of a Type (TypTree -- sic), or of non-types (TermTree). Either way, it is further broken down into subclasses.

Trees have children (other trees), a Position (pos), a Type (tpe) and a Symbol (symbol), though the latter might be null.

Tree is immutable at run time as well, but one can set symbols, types and positions are compile time, with macro's universe. We'll look more at trees when we discuss macros.

The remaining concepts are not important to the task at hand, so I'll not go into them. I'd like to draw attention, however, to StandardDefinitions, which is a superclass of JavaUniverse, and, therefore, directly available through it, DefinitionsApi, which is available through the method definitions on the previous trait. Together, they contain many symbols and types for things that are fundamental to the Scala language.

The Serializer

Let's start writing that serializer, then. First, I'd like to make clear the following:
  • This serialization is not comprehensive.
  • This serialization is not correct (no edge cases treated, like quotes inside strings).
  • I'm making no claims on its speed.
  • THIS IS NOT PRODUCTION CODE.
Now that I trust people will not shoot themselves in the foot and blame me (they can still shoot themselves without blaming me), let me state that this code is actually pretty general, and not particularly difficult to improve to the point of being useful. But there's nothing to be gained by doing so (in this blog), and plenty good JSON serializers out there.

So, what will this serializer handle? I'm planning on the following:
  • Strings (without handling required escaping).
  • AnyVal (without bothering with correcting Char or Unit).
  • Any class whose constructor parameters are preserved as fields.
The last one cover single parameter list case classes which, together with the first two, makes a reasonably comprehensive, if flawed, set of serializable things.

For the remainder of this post, I assume you have this import:

import scala.reflect.runtime.universe._

The signature for my serializer is this:

def serialize[T : TypeTag](obj: T): String

TypeTag deserves an explanation. A TypeTag is a type class which allows us to retrieve a Type. That is, it provides the Type equivalent of the getClass method, but through a type class pattern. Its implicit can be used by typeOf to retrieve the Type.

However, a TypeTag is not just a container for Type. It brings an Universe and a Mirror with it, and the latter is associated with class loaders (a TypeTag allows the creation of a Type on any mirror -- we'll look at mirrors later). That means, while getting a Type from a TypeTag is easy, getting a TypeTag needs more than just a Type.

With all that in mind, let's proceed.

Values and Strings

We start with values and strings, which are the easiest ones to handle if we don't treat special cases. If obj is either a value (a descendant of AnyVal), just output its string representation. If it's a String, output it between double quotes.

We just have to compare types. Given a TypeTag, I can use typeOf on abstract types to get their Type. We an also call typeOf on AnyVal, but there's a pre-defined constants for it which we'll make use of. So, for now, this is what we have:

val objType = typeOf[T]
if (objType <:< typeOf[String]) s""""$obj""""
else if (objType <:< definitions.AnyValTpe) obj.toString
else "Unknown"

I'm going to refactor this code further on, so I'm not writing the full body of the method for now. Next, let's do collections. To handle them, I need to recurse, and that causes a problem... we can get the Type for the type parameter of a collection, but, as we saw, getting a TypeTag from a Type is hard.

Since we don't really need a TypeTag anywhere, we'll create a nested method to recurse on, and pass Type explicitly everywhere.

To gather all collections in one step, we'll compare them against GenTraversable, but that causes another trouble for us. We cannot write typeOf[GenTraversable], without a type parameter, nor do we have a known type parameter to put there. Even if we were to retrieve the type parameter of objType -- which might not exist in first place -- we cannot insert it on our source code. We'll have to do something else instead, to get at that type parameter.

There's a method on Type which help us here. The method baseType on Type return a new Type that corresponds to the symbol passed to it if that Type is an ancestor of the first type. Ok, that's confusing as hell, so let's explain it with code:

scala> typeOf[List[Int]].baseType(typeOf[Seq[Int]].typeSymbol)
res26: reflect.runtime.universe.Type = Seq[Int]

scala> typeOf[List[Int]].baseType(typeOf[Seq[_]].typeSymbol)
res27: reflect.runtime.universe.Type = Seq[Int]

scala> typeOf[List[Int]].baseType(typeOf[Seq[String]].typeSymbol)
res28: reflect.runtime.universe.Type = Seq[Int]

scala> typeOf[Set[Int]].baseType(typeOf[Seq[Int]].typeSymbol) // Set is not <:< Seq
res29: reflect.runtime.universe.Type = <notype>

scala> typeOf[Seq[Int]].baseType(typeOf[List[Int]].typeSymbol) // Seq is not <:< List
res30: reflect.runtime.universe.Type = <notype>

scala> typeOf[Seq[Int]].baseType(typeOf[List[Int]].typeSymbol) == NoType
res31: Boolean = true

scala> typeOf[Seq[String]].typeSymbol // Symbols do not have type parameters
res32: reflect.runtime.universe.Symbol = trait Seq

I hope that's clearer now. You might have expected using baseType with Set and Seq to return Iterable, but that's the "least upper bound", available through the lub helper method on universes.

All we need now is to try to find a baseType, and, if we didn't get a NoType back, we have a collection and we can recurse on it, producing a JSON array. Here's all of it, with the new inner recursion method:

import scala.collection.GenTraversable
def serialize[T : TypeTag](obj: T): String = {
  def recurse(obj: Any, objType: Type): String = {
    if (objType <:< typeOf[String]) s""""$obj""""      
    else if (objType <:< definitions.AnyValTpe) obj.toString     
    else if (objType.baseType(typeOf[GenTraversable[_]].typeSymbol) != NoType) {
      val refinedType = objType.baseType(typeOf[GenTraversable[_]].typeSymbol)
      val TypeRef(_, _, argumentType :: Nil) = refinedType
      val objAsColl = obj.asInstanceOf[GenTraversable[_]]
      val recursedSeq = objAsColl map (recurse(_, argumentType))
      recursedSeq.mkString("[ ", ", ", " ]")
    } else "Unknown"
  }
  recurse(obj, typeOf[T])
}

That's pretty neat as far as it goes, -- and please notice the pattern matching used to extract the type parameter from GenTraversable -- but there are some issues with it. For instance, if the static type of the collection is something like List[Any], the elements we recurse into will be serialized as Unknown. If this seems like a small penalty, consider that any ADT will be represented by its top-most class. So, for instance, if I have a List[Tree] where Tree can be either the case class Node or the case class Leaf, we'll not be able to break it into nodes and leafs: we'll only see Tree.

One possibility to correct that is to go from the object's Class to its Type, taking advantage of the fact that we can always know the former at run time. This might lead to the opposite problem, where we serialize classes that are not meant to be visible. That would happen with List's subclass ::, for example, if the collection check did not catch it. But, since we are learning here, let's do it.

Before we do so, however, we need to talk about mirrors...

Mirror Images

What better way to get a reflection than use a mirror?

Sorry, I had to get that pun out. You probably know how Java reflection works: we have handles representing classes, methods, fields, etc. You then call methods on these handles to get other handles and, finally, you pass an instance to a method on one of these handles to act upon that instance. For example, from Class you get Method, and on Method you call invoke passing an instance to execute it.

Now, there are some problems with that, but I'm not the person to argue them. I refer you to Reflecting Scala by Yohann Coppel (PDF), and to Coppel's own main reference, Gilad Bracha's Mirrors: Design Principles for Meta-level Facilities of. Object-Oriented Programming Languages (PDF).

I'll concentrate on what they are and how you use them. From Reflection Pre-SIP, a mirror defines a hierarchy of symbols starting with the root package _root_ and provides methods to locate and define classes and singleton objects in that hierarchy.

Each universe may have one or more mirrors, and each mirror is tied to a Class Loader. This latter point is very important, because class loaders may load different classes with the same name. Of note, as well, class loaders may delegate to each other, so finding a class may involve doing the same search through a chain of class loaders.

Mirrors mostly abstract all this away, but, as a consequence, they are tied to class loaders, which means you must reflect through a mirror that knows the class of that instance.

There's a default mirror that will usually work, scala.reflect.runtime.currentMirror. That will use the class loader of the class that invokes it to create the mirror, which will usually be ok, but I'll avoid using it. Instead, I'll use a method provided by the universe to create a new mirror based on the class loader for the class of the instance. It goes like this:

runtimeMirror(obj.getClass.getClassLoader)

That's pretty straight-forward, and, I, at least, got used to it pretty quickly. I had a bit of resistance at first, but knowing it will always get the right classloader makes me feel more confident.

Now that we saw two distinct ways of getting a Mirror, let's talk about how we use it. Mirrors are supposed to reflect the language, and not its compile time representation. Each instance, class, field, method and module (packages or singleton objects) is reflected through a corresponding mirror.

Through these mirrors we can act on the things they mirror. Note, however, that you do not pass the instance to these mirrors: you generate a mirror for the instance (an InstanceMirror), and, from there, you can get mirror to its methods, fields and class. You can also get a ClassMirror without an instance, but you won't be able to do much more than get a mirror for the constructor or its companion object.

Let's put that in another way. A MethodMirror is associated not only with a method, but with a method on a specific instance. You don't pass an instance to it to invoke the method, you just invoke it.

Of Mirrors and Types

So, to put it all together, to invoke a method on an instance through reflection you need to:

val obj = "String"
val objClass = obj.getClass
val classClassLoader = objClass.getClassLoader
val classLoaderMirror = runtimeMirror(classClassLoader)
val classSymbol = classLoaderMirror.classSymbol(objClass)
val classType = classSymbol.typeSignature
val methodName = newTermName("length")
val methodSymbol = classType.member(methodName).asMethod
val instanceMirror = classLoaderMirror.reflect(obj)
val methodMirror = instanceMirror.reflectMethod(methodSymbol)
methodMirror.apply() // == (obj.length: Any)

Got all that? That, of course, assumes we know that there is a method named length on the thing we are reflecting about. If that we not the case, we would get a nasty surprise when we tried to retrieve the method's symbol.

More importantly right now, we also saw how we go from an instance to that instance's type, though we loose any type parameters along the way (or, more precisely, they are abstract).

Untagged Serialization

Let's then rewrite what we had before, but without TypeTag.

import scala.collection.GenTraversable
def serialize(obj: Any): String = {
  val mirror = runtimeMirror(obj.getClass.getClassLoader)
  val objType = mirror.classSymbol(obj.getClass).typeSignature
  val StringTpe = mirror.classSymbol(classOf[String]).typeSignature

  if (objType <:< StringTpe) s""""$obj""""
  else if (objType <:< definitions.AnyValTpe) obj.toString // DOES NOT WORK!!!
  else if (objType.baseType(typeOf[GenTraversable[_]].typeSymbol) != NoType) {
    val refinedType = objType.baseType(typeOf[GenTraversable[_]].typeSymbol)
    val TypeRef(_, _, argumentType :: Nil) = refinedType
    val objAsColl = obj.asInstanceOf[GenTraversable[_]]
    val recursedSeq = objAsColl map serialize
    recursedSeq.mkString("[ ", ", ", " ]")
  } else "Unknown"
}

Anyway, there's one big problem there, right? Because we are coming from Java reflection, comparing to AnyVal doesn't work at all. We can compare against the value types themselves, such as Int, but it won't work here because any Int we find with this method will be boxed, and Integer is not a subtype of Int.

So while we gained something, we lost the ability of handling AnyVal easily. What to do?

End of Part 2

What a perfect cliffhanger!

Hopefully, someone will tell me what to do before part 3 comes out, and it will look like I know what I am doing! :-)

More seriously, we have covered a lot of ground already. Whatever path we take from here will tread on familiar ground, just enriching what we have seen so far a bit more, making the code more elaborate.  Until we get to macros, anyway.

This is a good place to stop and consider what we have seen so far.

Scala 2.10 will come with a reflection library. That library is built with the cake pattern on layers, so that code common to standard library, reflection library and compiler gets shared. It also keeps the reflection library in sync with the compiler, and makes the reflection more faithful to the language.

The reflection layers are called Universes, and there are a few of them to choose from. A Universe contains APIs and implementations for a number of interrelated concepts. We saw the most important ones, and used Symbol, Type and Name quite a bit. We also saw a bit of TypeTag, which replaces Manifest on Scala 2.10 (yeah, I know -- I didn't tell you that before).

We were then surprised with the news that there was something else to learn about: mirrors! It turns out we can't actually do anything without a mirror to an instance, a class, a method or whatever. Mirrors are tied to class loaders, which introduces some complications and a little bit of ceremony, but they aren't really that hard to use.

Empowered with our newly gained knowledge, we proceeded to tack the serialization problem in two different ways, with mixed results.

At that point we were left in a major cliffhanger: What path to take? Is there a way around the problems? And what about macros?

Rest assured, because this is to be continued...

Epilogue

While I waited and waited and waited, Alois Cochard wrote Sherpa, a macro-based serialization toolkit. I'm guessing one can learn a lot by looking at its source code!