Postmortem
This post is meant for myself.
For the past six weeks, I have been working hard on a system target exclusively at the election day this past Sunday, October 7th 2012, in my country. Granted that I took off a few days to go to Agile Brazil 2012, but I returned every one of them with plenty interest.
So I write this for myself, as a postmortem, to help me clear up some ideas about it all. You see, this was the first time I had the opportunity to use Scala for anything critical -- or, for that matter, larger than a script -- as part of my job.
So, seriously, you don't have to read this. I'm sure there's some other post or presentation about Scala that's bound to be way more interesting than this, and which you haven't seen yet. Look up Scala Days, Skills Matter or NE Scala, if you are out of ideas. :-)
The Setting
Since you are still with me, I suppose I should explain a bit what the program was intended to do. It should help me figure out how within the scope everything turned out as well.
For a critical project, I'm a bit ashamed to admit it was nothing more than an ETL - Extract, Transform and Load. It was supposed to download election data, extract relevant information, and load it into a database. A few qualities deemed to be paramount:
- Resilience. The program should be able to withstand processing errors such as invalid data, connectivity problems and such. Depending on the kind of error, it would either retry or ignore -- but always log!
- Reliability. If the program deemed a particularly process to be successfully concluded, it damn well better be both error free and finished. That was harder than it sounds, which is something I should look into.
- Real Time. This program would feed systems used to provide news feeds on the election, so the sooner it got news out, the better. There was actually a software provided to do the extraction part in batch, which set up the bar for us.
- Resource Aware. This was a vague requirement, and not everyone thought it was a requirement at all. The point was being able to use lots of resources to extract data, but without going overboard and somehow get throttled down.
Highlights
Before I go any further on the real meat -- the problems -- I want to highlight the things that really shined, in my opinion.
Dispatch
This was the first time I used this controversial Scala library -- scala.io.Source had sufficed for me before. I understand the current generation, 0.9.x, has some important differences from previous ones. It has a much reduced set of operators, and it has full support for operator-free use. Also, it's asynchronous, which is were the fun came.
Let me be plain: Dispatch is awesome., and that has nothing to do with operators. Sure, the clean syntax is nice, though sometimes I'd prefer it had better support for tuple varargs. But that's just a side show. No, the main game is its asynchronous nature, and how it got implemented.
Dispatch is based on Async HTTP Client, which provides a nice layer over Java's asynchronous I/O libraries. And, of course, Dispatch provides a Scala layer over Async HTTP, making it even nicer. It does not, however hide Async HTTP. On the contrary, you work with Async HTTP objects at all times, and have the Dispatch abstractions available as enhancements.
Now, doing I/O asynchronously means you make a requisition and tell the library what it should do with it once the requisition is done, instead of waiting for the result and then doing stuff with it. That's what you see with NodeJS callbacks, for instance, and anyone who worked long enough with it should be at least wary of its traps.
The abstraction provided by Dispatch to work around that is called Promise, though it better approaches what Scala 2.10 will introduce as Future. Nope, that's not the old actor-based Future -- it's actually something that everyone has been doing (such as Dispatch), and which prompted it's inclusion in the library. See Scala Improvement Process 14 for more details. No, really, see it.
The gist of it (Dispatch-style) is that you are waiting for something to be completed (like the HTTP request you made), and you are holding onto an object, the promise, that will get it to you once it's finished. If an exception is thrown by the process, your program then the exception will be encapsulated into that promise as well. Extract it, and the exception is thrown at the point you extracted it. And, of course, Scala-style, you can map (or flatMap that shit) the results as you wish.
There's one thing I like better on Dispatch's Promise than Scala's Future is that it provides methods to convert the Promise[T] into a Promise[Option[T]] or Promise[Either[Throwable, T]], which makes error handling nice. I don't understand why Scala's Future doesn't have these simple interfaces. Option seems to require at least a one liner through andThen and toOption on a Try, but getting an Either out seems even harder. On Dispath, promise.either returns a promise for Either, and promise.Option returns a promise for Option, and that's that.
Dispatch also has something to convert an iterable of promises into a promise of iterables. Happily, Scala 2.10 has that as well, and even more.
So, anyway, all this meant it was very easy to chain transformations triggered by HTTP requests, logging errors were it made sense, flattening unsuccessful computations into None, and getting Either to code that could make sense out of it.
Performance and concurrency, easily understandable, easy to manipulate, FTW.
Specs2
I had some pretty tough challenges testing everything. Most of the code was written TDD, but there were plenty XMLs, JSONs, timing issues, concurrency issues, a dozen mostly similar classes that had to be tested for its small differences, etc. It's not that it was hard, though the wrong framework certainly could have made it so, but it made for some pretty gruesome tests, and one wants tests to be as easy to understand as possible.
Time and time again, Specs2 came through with the goods. I had used Specs2 before a few times already, but, interestingly enough, most of the features I was most familiar with (parser combinator testers, given-when-then combinators) were completely absent from my tests. To make up for it, I feel like I have seen everything that I had overlooked up to now. I know that's not true, because I have a list, you see, but...
I'm not claiming Specs2 does anything other frameworks don't -- the same way I didn't know Specs2 features I came to use, I don't know how well other frameworks handle them. I'm just saying that there wasn't a single thing I needed that couldn't be done with it, and in a very satisfactory manner.
A special mention goes to Mockito, a great mocking framework made greater by how well Specs2 is integrated with it.
Eric Torreborre
Eric's Specs2 author. The level of support he provided me was amazing, and anyone who spends sometime reading specs2 mailing list will know that's how he treats everyone. Did you find a bug, or needed something just a tiny little bit different than what Specs2 provides? It's rare the day I don't see Eric rolling out a new snapshot with it within 24 hours, tops.
Do you have trouble understanding how to make your tests work a certain way, or how to do something in Specs2? Eric will discuss it until you come away enlightened.
I've always been a bit amazed at Eric's awesomeness, but having it while writing a critical piece of software made it really personal. So, thanks Eric.
Unfiltered
Unfiltered is a cousin (maybe a brother) to Dispatch, handling the other end of HTTP requests. What's nice about Unfiltered is that it works as a library, making it really easy to add small HTTP servers to your code. This wasn't part of the production code, but it made testing the parts of the code that did HTTP requests really easy.
I have no basis to make recommendations on it as a general library (though it certainly bears investigation), but I heartily endorse it to handle those interface (as in, with other systems) tests. It was doubly useful to me, since both my Input and my Output (the database I was feeding) were HTTP servers.
AVSL and Grizzled SLF4J
These nice libraries by Brian Clapper made the logging goes smoothly. It was particularly striking given that some other modules of the project got a lot of trouble on this regard.
Anyway, I picked AVSL as a suggestion from Dispatch, which uses SLF4J-based logging. Knowing I could replace AVSL had a soothing effect, and it was pretty easy to get working quickly. I started out with an AVSL-only logging to handle some immediate problems I was having, and then replaced them with Grizzled to be able to replace AVSL if needed.
AVSL was easy to configure and use, and it was particularly nice being able to get extra verbose logging on some parts of my code while keeping the rest relatively quiet when I tried to figure out why something wasn't working as I expected it to.
IntelliJ IDEA
On the first week, I worked solely with Sublime Text 2, since that was the editor of preference of the person I was pairing with. Given that he didn't have Scala experience, I thought it was fair to have anything else I could as familiar as possible to him.
Then he moved on to some other part of the project that was demanding some urgent attention, and I was left finishing up the project (until the scope changed, that is).
Now, I'm a vim user, and happily so, but I came up upon some issues that demanded heavy refactoring, and so I turned to IntelliJ. I wish IntelliJ did more -- Scala support pales in comparison to Java's -- but the final result would have looked a lot uglier than it is if not for IntelliJ.
I bought a personal license for it at the start of the third week, as I geared up to start working on the nodejs module. That never happened, but I don't regret spending that money.
TDD
So, I arrived 9:20 am on Sunday to do the first load of static data for the election ("static" data could, and did, change at any time to account for judicial decisions). I ran the code and got some database insertion errors. Now, I had had some trouble with the database during stress tests, but this didn't look like the problems we had, and there certainly wasn't any stress going on.
So I investigated the problem, found the cause in some badly formatted input, changed the code to account for it, produced a new snapshot, tested it, and put it in production. I didn't ask for permission, I didn't talk to my colleages to hear what they thought about it, I just did it, not only without fear, but with absolute confidence I didn't break anything with that change. On the morning of the election day.
I've seen plenty of last minute changes throughout my career, mind you, but I might even take a bet on that more than half of them broke something else. Changing with confidence that nothing broke -- without breaking things! -- is something I have only known through TDD.
That was hardly the only time I got benefits out of TDD, but it serves to epitomize my whole experience. The scope changed on us many times, and I had to do fundamental changes to my code as result, but nothing that was working before ever stopped working.
Of course, I did introduce a few bugs despite TDD, though they were usually very short lived.
Elastic Search Head (boring link, but follow to the github link)
This is an HTML5 application that serves as a general interface/console to the Elastic Search database. You can just clone the repository and open your browser on it's index.html to use it -- there are no additional requirements.
I cannot imagine ever working with Elastic Search without it, but, then again, there's absolutely no reason why should that ever happen -- it's free, and the only requisite is an HTML5 enabled browser. If you have an Elastic Search you interact with, get this.
Furthermore, I'm now very unhappy with everything else for not having something like it. Well, maybe they have and I have lived in the dark ages so far. Now that I have seen the light, though, I have some pretty high standards concerning what kind of tooling should be freely available to a database of any kind.
Elastic Search itself is responsible for it to a great extent -- like this, there are many other tools available for it.
Bumps
Naturally, there were parts of the project that, while providing value overall, were besmirched by some problems. Now, the following stuff provided overall net gains for the project, but I'm going to concentrate on what was difficult, as that's the places were improvements can be made.
SBT
You knew I was going to say that, don't you?
SBT is a major source of love-hate relationships. I mostly like it, and get confused why people seem to have so much trouble with it. It seems to come down to multi-module projects (that is, not using an internal repository to handle separate projects as separate projects), and plugins (and why do people need to do so much weird stuff anyway?). Well... it could stand to be more discoverable. Things like "show scala-version" vs "set scalaVersion" also get in the way, and, to further make this point, I still don't get exactly why it has to be this way.
But none of this was a problem to me. This was a single module project, with a build.sbt and a project/plugins.sbt, and that was that. Setting things up was a simple matter of copy&paste from library and sbt plugins. So, no, knowing what lines to put in these files never presented any problem.
My problem was getting SBT to run at all. Being a Scala geek, I've been writing my sbt shell scripts for a long time now, downloading launcher jars and stuff like that, but what worked for me on my own free time was simply not good enough to get such a critical project going.
Simply put, people who were not me and didn't really understand much of Scala, let alone SBT, had to be able to build and test the project, with minimal instructions. Fortunately, the SBT site now has nice Debian packages for download. Only it doesn't, really -- it has a package that installs a source for the SBT package, and it turns out that I manage my sources automatically through Puppet, which made that very annoying. I had to install the package to find out what the source was so I could put it on the build server configuration.
Only that didn't really help. You see, I use SBT 0.12 since about the time it came out, and that's not the version available on the site above. Worse, some of the global plugins I have are incompatible with the version provided by the Debian package.
The problems didn't stop there, however. The developers machine need to go out through an NTLM authenticated proxy with Basic Authentication as fallback. Unix tools in general and Java in particular seems to have a lot of trouble with this setup. Browsers have no problem whatsoever, but getting SBT to download itself (the parts that are not the launcher) takes some doing. Using the proxy, yes, that's easy. Getting it to authenticate, now, there's a challenge.
In the end, I ended bundling SBT 0.12's jar, a hand crafted sbt shell script, and some obscure configuration options to make it download everything from our Artifactory. That meant adding every single dependency to it, and some of those dependencies -- like those of typesafe -- were tough to get right. Lots of trouble with ivy2 repos vs maven repos, and lots of maven repos were I had to tell Artifactory not to validate the poms.
Once all of that was ready, the rest was a breeze. After all the trouble I had, I delayed doing the Jenkins build for a long time, only to get it working the first time in less than ten minutes when i finally got to it. Code coverage reports, code duplication, style checks, source code x-ray, javadocs, single jar files, generation of ensime project for Sublime2, generation of IntelliJ and Eclipse projects... Almost all of it accomplished effortlessly.
Also, sbt ~test worked really well during the time I used Sublime Text 2, but I suppose that isn't a surprise to anyone. Or, at least, I sure hope so.
SCCT
This is the tool I used to generate code coverage. It mostly works, but I doubted it for a while. You see, you have to compile the whole project (or, perhaps, all of the tests) with its action for it to work. If it uses partial compilation on something, it only shows coverage for those files that were changed!
Once I figured that, it was a simple thing to add a clean to the Jenkins project, or when running on my own machine. Since I only look at code coverage sporadically, this wasn't a problem. There was no need for integration on Jenkins, so Jenkins build times were not a problem either. If they were, it would be pretty easy to separate normal testing from full artifact generation.
So, what do I mean by "mostly"? It displays abstract classes and traits as untested, and it really bothers me to see all those 0% where there isn't any code to be tested to begin with. Also, it marks literal constants (that is, untyped final vals on objects) as untested. That's fair enough, and I suppose it can't be helped unless you get the compiler to do some instrumentation for you.
As for the rest, everything works, and it integrates nicely with Jenkins.
Scala X-Ray (Browse)
I don't know why I had so much trouble setting this up. I expect I assumed it could only be an SBT plugin, so all the instructions I found didn't make any sense at all. Once I realized it is a compiler plugin and just added the two lines indicated by the site above to my build.sbt, it worked. But I did spend a lot of time trying to figure it out, and didn't even get it working on the first time.
Elastic Search
After a brief flertation with some other solutions, we settled on Elastic Search as the data store. It was not exactly what we wanted, but there wasn't much time to explore solutions. ES was something I was mildly familiar with, and given that most of the project was written in Javascript, interacting with ES was really easy. It also benefited from Dispatch on the Scala side.
Unfortunately, the actual load we put on it was very different from what we expected. We expected lots of reads, few updates, but actually got something the other way around, and that caused performance issues (on stress tests only, though -- normal loads were fine).
Also, it's log didn't help much when we started having problems. In the end, it turned out that one of the tools we were using to monitor it, BigDesk, was actually causing it to crash. We stopped using it, and, aside from it not being capable of handling the stress test loads, it worked fine otherwise.
It's HTTP/JSON interface really did help get everything working easily, and it's data model, while targetted at uses rather different than ours, worked fine for us.
The Ugly Parts
I can't honestly point to anything and say that was a a major problem. The closest thing to that would be the changes in scope with progressively tighter deadlines (it's not like election day could be postponed, after all), but that comes with the job, and it's our responsibility to deal with it the best we can.
Some problems resulting from ill-conceived solutions to the new requisites marred the end result, but the alternative to have something that didn't work well would have been not having anything. The original scope was fully delivered, and worked with a single hitch.
In Hindsight
So, what I learned from all that?
I'd like to start with some lessons about Scala. I wonder what's the maturity level displayed by my code... There are no type classes, just two implicit declarations, and hardly any of the more complex you may have seen me write here and elsewhere. There's not a fold to be seen anywhere.
And, yet, it's mostly functional. There are some thing I wished to make immutable, but I couldn't figure out how to handle the concurrency requirements in a functional manner. There are a few places were I knew a way to make something immutable, but did not choose to do so, and ended up regretting it. And, yet, it looks -- to me -- pretty normal.
Maybe I'm wearing blinders. There are certainly plenty maps and a few flatMaps, which couldn't really be otherwise given the role Dispatch plays in it. But, to me, the code looks downright tame. That surprised me a lot, given the code size.
Very few case classes, but just perfect when I needed them. Some sealed traits and objects which could have been replaced by a decent enumeration (but do have a small advantage). One implicit conversion to a tuple so that I could use the unzip method, and some other place I needed to make an Ordering type implicit to get around the fact it isn't contra-variant.
Named and default parameters were put to very good use -- I wish Scala would let me get away with using named when interfacing with Java. It might not be able to check the parameter name, reorder parameters, and there's certainly no defaults, but method calls with booleans and primitives feel just naked without the parameter name.
Test code was another matter -- Specs2 brings plenty of magic to the table, but it's a standard magic so to speak, which makes it possible for people who don't understand it to use it by rote.
So, what about actors? Given my experience with Scala and the concurrent nature of my problem, you might have wondered why I didn't use actors to solve it. And, if that didn't occur to you naturally, the title of this post should certainly have done so!
It's... complicated. The original scope didn't look like it would benefit from using actors, and I still think so. As I originally envisioned it, a main thread would start Dispatch downloads, and the mappings done to the Promise returned by them would handle everything else. Nothing would ever be returned to the main thread. A single semaphore would put a hard limit on how many simultaneous downloads would be allowed, and the main thread would sleep until the time for more downloads came.
In that scenario, using actors would just add overhead. However, things changed. The first big change was that the interaction with the other modules would have to be batched. That's something I was never comfortable with, since it lost precious seconds gained by the ETL's architecture. But, most importantly, it meant information would have to go from the async download trigger threads to the main loop. At that point, we have a bi-directional conversation, which is were actors start getting interesting.
Still, the interaction was very simple, and none of the more sophisticated interactions enabled by actors would be of any use. As things progressed, however, that changed, as the contents of some downloads started dictating the need for other downloads, which could trigger other downloads, and failing any one of those downloads should dictate the outcome for the ones that triggered them.
And now you have a complex net of interactions, some of which would benefit from let-it-fail and corresponding recovery mechanisms. It now looks like a perfect fit for actors.
So, the code I wrote does handle almost every error condition, and with some refactoring it could have handled all of them. However, what started out natural and intuitive has become confusing. If I had used a simple actor setup right from the start, bare bones as it would have been, I could have pursued better solutions as the requirements changed.
The other, more troubling, consideration is that, given the batching limitation imposed upstream from my system, maybe we should have gone with the standard downloader and a bare bones extraction&load script.
In Conclusion
Overall, I don't regret any of the technical choices I made. Surely a second version would benefit from hindsight (and be vulnerable to the second version syndrome), but the choices I made proved equal to the task, and quite adequate to withstand the changing requisites and short deadline.
As a side note, this project could have gone with Ruby just as likely as with Scala, and it was just a coincidence that pushed it to Scala. I wonder how that would have turned out, and it is an experience I'd have loved to have.
And if you have read it all and think it was a waste of time, don't say I didn't warn you! :-)
Thanks Daniel for your kind words and I'm glad that specs2 helped!
ReplyDeleteThat's really well written and full of useful insights and ideas. I also like your level-headed look at Scala: "there are some things that would be nice, but you know what, in the grand scheme of things, it's all right". In light of the favourite Ruby / RoR bashing, I appreciate your point that the project may have gone just as well in Ruby. It's just a shame that no-one will ever know.
ReplyDeleteGreat article Daniel, enjoyed the read. I had a similar situation where we went with actors and I was wondering whether futures would result in code that is more straight-forward and easier to test. I would be curious to hear what your thoughts are on that. For us batching was definitely a factor, and turned out nice to implement via actors. I'm also curious whether a mixed approach would work as well (perhaps an actor at the lower that could return a future via 'ask' to the higher level code).
ReplyDeleteHi, Alex. My current thinking on what would have been the best solution to the final requisites is a mixed approach. A net of actors comprised of "source" actors, which control which files to download and when, plus a single downloader actor which would throttle the number of concurrent downloads. The downloads would be Dispatch Promises, which would send a response message at the end of its processing, but, otherwise, would be left to java.nio onSuccess/onFailure thread pool (behind the scenes -- Dispatch abstracts that away).
ReplyDeleteThere's another type of actor that would be responsible for notifying the other system modules. It's present implementation suffices, but it is unsophisticated. In an actor system, a main actor would spawn other actors to notify specific modules, and maybe also send them the notification batches. That would allow a fine control on how to handle problems notifying other modules.
The key here is that I originally had a single source, and, by the end, had over 30 distinct sources with varying degrees of interdependency, and could well have used an even finer grained way of identifying files to download and scheduling them (the problem is really just a matter of prioritizing -- downloading everything that's been update is straightforward).
On the one thing you like better on Dispatch's Promise over Scala's Future, will you get that feedback to Scala so that it has a chance to improve post 2.10? Thanks much, Daniel, for sharing in detail your experience and thinking.
ReplyDeleteBest,
Jon
Thanks for sharing your experience.
ReplyDeleteWhat are your thoughts on the applicability of reactive streams (enumerators/enumeratees) and stream interleaving, etc in Play! Framework 2 to these problems?
@Irfan
DeleteI've pondered many times during the development whether I shouldn't go to an iteratee-based solution, but I couldn't visualize how to handle the concurrency requirements with iteratees.