Sunday, February 24, 2008

Guidewire Development Blog

I have developers from Guidewire customer commenting on my blog so I would like to mention that Guidewire now has its own development blog which I think you will find interesting.


Proposal for Agile 2008 Submitted

Even though I still have one more post to go for this topic on Enterprise Agile Testing, submission deadline for Agile 2008 is approaching. So I looked at other submissions, looked at what I have written, and submitted my proposal:

Feedbacks are welcome either through the submission page or my website

Wednesday, February 20, 2008

Enterprise Agile Testing Part III: Managing Tests with ToolsHarness, Individually

This is the third part of the Enterprise Agile Testing: Continuous Integration has proven to be one of the most important practices in agile software development. Every time that a developer checks in the code, the resulting code base is rebuilt and tests are run against it. The end result of the integration tells everyone on the project if the codebase is good enough for release. There are some prefer synchronous continuous integration through a push-button process over an asynchronous process through a tool like CruiseControl. But everyone agrees that it is something very useful.

the Difficulty of Holding the Line

With a tool or not, the most difficult part of installing such a process is probably holding the line of "zero broken tests". In my past consulting and coaching experience, it sometimes takes great effort and time to get the team into the habit of running all the unit tests before checking into the code, as well as making writing tests and test fixing as the highest priority, and keep tuning the test so that the whole process does not exceed ten minutes. Even that, not all the teams kept up with the practice after we left the project on a good note.

I recently got a chance to catch up Greg, one of the ex-ThoughtWorkers that I used to work with and respect. He showed interest in what I am writing and expressed his opinion, as I quote:

Our test suite is too large and too slow to run with every build. We are lucky to get results once a day.

- Not everyone cares about the unit tests to the same degree. Some people are too busy to track down failures right away. Not everyone sees the value in the unit tests, mostly because our coverage figures aren't high enough.
- Not everyone has the skills to write decent tests or design their code in a modular, testable fashion.

When I read that email, I became more motivated working on this third post, because it is exactly what I want to write about. In my previous job, I found out that the only way to make agile development work was to follow the XP practices, especially when it comes to continuous integration. In an enterprise environment (as defined by Introduction ), there seemed to be no middle ground between "green all the time" and "red all the time". It seemed that the moment the team fell off the status of zero broken test and couldn't recover quickly, they would be in a deep hole right away.

The first thing that I have learned at Guidewire, is that the above problems can be solved better with the help of a comprehensive continuous integration tool, ToolsHarness in this case. I am not saying this is a silver bullet, because there are still a lot of development practices and disciplines required. But I can certainly say that I am now seeing the light.

Test Farm for Parallel Testing

No matter how hard you try, eventually you will not have enough time to run all the tests that you would like before checking in code. When this becomes the case, test breaking will become the norm rather than exception, and the XP way of handling broken builds would not apply anymore. For a complicated system, the time to run the full tests will be huge. At the same time, agile software development dictates fast feedback time on code changes. The longer the turnaround time is, the more fraction there is on the iterative development - how is it even possible for the team to build on top of something not yet proven to be working and expect to have a high throughput?

Guidewire has a big testing farm, composed of dozens of machines, mostly on Linux and configured with H2 database. These machines are configured so that when a build is available, they will check out the test suites to run.

When a developer checks in a change list, ToolsHarness will first pull the source and do a full build to make sure that the projects still compile. Once the compiling finishes successfully, ToolsHarness will post the build for test. The tests used to be divided into different suites based on which test class they extends, for example, Database test, Metadata test, or Server test. With the introduction of TestBase, they are all converted to NewTestSuite and NewSmokeTestSuite. (The difference is that the tests in NewSmokeTestSuite are acceptance tests, which are full end-to-end test and require additional sample data.)

Based on the previous run of the test, the suites are divided evenly into several parts, so that each testing machine can check out each part and run it. In this way through parallel testing, it takes no more than 20 minutes to run a suite that would normally take hours to finish. This system is highly scalable, because all you need to do is adding more machines. With such a fast feedback loop, the developers can work on the large code base and still make medium size changes. The worst thing that can happen would be to revert the change that you made less than half an hour ago.

It is easy to run the tests against different databases and J2EE servers too. Some testing machines are configured with Oracle database or SQL database, some are configured on Windows platform, and some are configured using Tomcat or WebLogic. As I am writing this post, the tools pod (the team in charge of developing this server) is working on ‘customer build testing’, so that the test environment will be exactly as the production environment when running the acceptance tests.

Tracking Tests Individually

With a large team, you will have developers with different skill sets. While it is easy for an senior developers to be conscious about making small changes at a time, and be able to identify the problem based on the broken test, it normally takes a much longer time for a junior developer to fix them. Unless all your senior developers happen to be good coaches, you are going to be stuck with broken tests popping up here and there for a while.

When a testing machine is done with the test, it will post the test result back to the ToolsHarness. ToolsHarness will parse the XML file and store the result of each test into the database to track them. The benefit of this is so that developers can start tracking the tests individually. When a test is broken, ToolsHarness will make an educated guess based on the change list and changed package to assign it to the developer. If it turns out to be wrong, the developer can easily assign it to the right person.
When a developer logs into ToolsHarness, the first page, Desktop, contains a list of tests that have been assigned to him or her. In this way, you won’t be distracted as long as there are no tests assigned to you. The summary of all the tests is also on this page so that you know not to check-in anything when there are hundreds of tests broken, or give the senior developers a clue to check in others in fixing tests.

For each test, you can see the failure message in the form of the stack trace, the change lists associated with it, the history of the test to help you figure out the reason that the test is broken. You can also look into the log directory to see any additional generated file like server log and HTML page snapshots.

If you have written a broken test that you cannot yet fix, you can annotate it with KnownBreak, and it will show up properly in the ToolsHarness. If you have determined that a test in failing none-deterministically but you still cannot yet figure out why, you can mark it with NoneDeterministic, and it will show up as such in ToolsHarness. The key is to keep the noise of broken to minimum, if not zero, so that developers will get the notification accurately and fix them effectively.

Localizing the Damage through Branches

With aggressive refactoring, you will not be able to leave your platform code alone. Sometimes you know that the only way to be sure is to check in the code, let the continuous integration server do a full test on the changes. With this approach, you are going to risk putting the build into unstable state for a while before you can figure out the best solution. If the whole development team has to rely on a good build, they will either be out of commission for a while, or they are going to accumulate changes that will cause another wave of instability after you are done. And that is if you are lucky enough to quickly finish with the cycle of check in, revert, revert the revert to make more changes, check in, revert, ...

Sometimes, especially for the platform team and application framework team, you need to make big change in the code base. When it proved to break lots of tests in ToolsHarness, the best thing to do is to move forward by checking in more fixes, instead of reverting the change to do it again. The only problem with that, is that the code base comes unstable during the process. If you have a large team with others working on other areas at the same time, the number of broken tests could be disturbing. And as Greg pointed out at the beginning, not every team care about the tests in the same way. Some would prefer finishing the job at hand before tracking down the broken tests. The line for none-deterministic tests are much more blurry.

At Guidewire, the way to make it work for all the teams working on the same code base is through the branches. Each team works on an 'active' branch that they are free to do whatever they feel most productive. The only rule that they need to follow is to fix all the tests before pushing the change to the 'stable' branch, where every team is pulling change from on a daily basis. If it takes a team a couple of week to get into a stable state, then they will have to risk the merge conflicts. For a team that is diligent on fixing the test, their branch will be stable most of the time, and pushing would come much easier. It is exactly like what most agile books recommend on checking code -- check in as often as possible as long as the tests are passing -- except on the level of branches.
Sometimes you can still make a mistake and push a broken test (or more) into the stable. In this case, there is a 'merge' branch that you can use to fix the builds. Most of the time however, the fix is very easy, then all you need to do is to find out which team is next in line to push to stable, and manually integrate your fix into their branch. There are merge script written in Ruby to help the pull and push process. They are very robust and well tested, so that majority of the push and pull are merely a push-button process, i.e., you type in the command "merge.rb --pull", and you are good to go. We have a merge machine set up specifically for this job, so that the merger wouldn't have to give up his or her local resources.


As many articles, books, blogs have pointed out, people are always the center of the agile development process. Even with a powerful tool like ToolsHarness, it is still up for the team to apply disciplines and agile practices. Because the team does not have to stop everything to fix any broken tests, it is actually easy for people to ignore the tests. Given enough time, enough code changes would have been checked in, making it much harder than it should be to fix the tests.

So the rule of the thumb is still the same: fixing any broken test as quickly as possible when they come up. The old tricks still apply, which including things like revert the changes that broke the build, make small changes, check-in often, monitoring email notifications, run tests before checking in, etc.

Friday, February 08, 2008

Enterprise Agile Testing Part II : Test Environment Set Up with TestBase

This is the second part of the Enterprise Agile Testing (Not exactly following my original order here):

  • Introduction
  • Test fixtures like assertion, builder
  • Test Environment Set Up with TestBase
  • ToolsHarness, a continuous integration server farm that treats tests individually
  • Active and stable branch, localizing the damage

Testing through Inversion of Control (IoC) Container

(For concept of IoC container, see Martin Fowler's article: Inversion of Control Containers and the Dependency Injection Pattern)

Ever since testing through dependency injection was formally named, it has become the most popular pattern for unit testing. You control the environment in which the class is tested by carefully constructing the classes in the dependency before injecting them into the class under test. In this style, a typical test is composed of three parts. They are named differently if you talk to different people, but the one that I like is what I learned when I presented the "Given, When, Ensure" notation of jBehave to BayXP meetings:

* Assemble: Construct the environment that the test is going to run.
* Act: Invoke the method(s) that you want to test.
* Assert: Assert that the tested method has caused the predicted change in the environment.

It is safe to say that anyone who has done enough testing won't have any problem with "Act" and "Assert". It is the "Assemble" that has been giving us trouble. The following is an illustration extending PicoContainer's diagram.
To test the class marked by the big arrow, you will need to create the world as this class sees, then invoke methods on class and assert changes caused by the invocation. Among the ways of constructing the world according to the class under test, Stub and Mocks are probably the only pattern that has been well documented. As indicated by the article, each solution has its own limitations. For a small to medium size application, these kind of tests are generally manageable. But if you have done enough enterprise application development (as defined in Introduction), then you probably have seen your fair share of mocks and stubs getting out of hand, as was the case for Guidewire tests until year 2007.

Testing with a Loaded Container

During the past year, Guidewire has been slowly converting its tests onto a home-grown JUnit extension framework. The framework does the heavy lifting of constructing the dependencies, so that by the time the test code is called, all the dependencies have already been set up properly. If you really want to, you can even access a full web container through the embedded Jetty server. By putting your class inside a full container, you get a lot of benefits that you won't normally with a bare bone unit test, and you without breaking a sweat.

The immediate benefit is that you no long deal with mocking, stubbing, guessing. When your test calls into a method, you can be sure that the class would be in the same state as it is called in the real world (It might still not be in the one that you want in your test but that is a separate issue). Without mocking and stubbing, you don't need to walk on the egg shells any more as you change the class responsibilities and collaborations. You can call into a real messaging manager through the container, enable a message destination, commit an entity, and test that the message of the changes appear. All the codes paths match exactly the real world, so that your won't have any integration surprises down the road.

Because all the required validations are turned on, you are forced to create realistic data. With realistic data, your test becomes more realistic. You can put your test under debug mode at any time and get a good sense of what the data will be like in a real server. If you make a mistake and forget to set a non-nullable field, your test will blow up right away.

With a loaded container, you feel more confidence in the class that you are designing. Because you can see easily how this class fits into the whole world, you can make sure it becomes a good citizen by doing just its job, no more, no less.

This framework is extremely flexible, making it very powerful. You can modify the testing environment by annotating your test and registering your own annotation handlers. In this way, you can add additional set up code without even creating your own super test base, a typical case of favoring composition over inheritance. You will see many annotations that we have built already in the past half year.

Performance Improvement Considerations

Of course, all these are much easier said than done. And we are sort of going against the conventional wisdom of unit testing here. The first question most readers will raise would probably be "Loading the whole container for a simple unit test??? How can your test perform!?" Please trust me when I say that I had the same concerns. But after adapting to it for half a year, I think this is definitely a good solution.

First of all, performance is overrated. No, I am just kidding. The first thing that I would like to say is that if you are a TDD veteran, in that you know how to design your class such that you can manage your own dependencies well most of the time, then kudos to you and you can use the @RunLevel annotation to tell the framework not to do any set up that for you (see below)

I was actually not totally joking. I would like to argue that for an enterprise application (as described in Introduction), it is not uncommon that some part of the system is not designed as cleanly as it could have been. As a result, you have to choose between making the test run fast through the kind of mock that no one knows what is going on, or making the test run a bit slower but reflects the real system. Since design validation is the whole purpose of tests, I vouch for testing the right thing with a bit of sacrifice on the speed.

In addition, the test framework has a set of performance considerations in place to make sure that overall the test performs well.

Run Level

Guidewire applications have the notion of run level as a way to bring the system online in stages. You can annotate each test with the desired run level to have just the things you need set up before the test. The following is the list of run levels that I have used.

* NONE: This is just like good-old jUnit test.
* Shutdown: At this level, you have all the system configuration read in and meta data loaded. You can run any test that does not touch database
* No Daemon: This is the default value. At this level, you have the database connection initialized and the schema updated. You can run any test that hits the database.
* Multiple User: At this level, you have a full blown application server with background batch process running. This is typically used by QA for acceptance testing.

Database Tests

By default, all tests are using H2 as the embedded databases which greatly improves the test performance. I have been a big fan on in-memory database since HSQLDB. DBFixture is the proof.

During the development, the database schema changes all the time. Guidewire products have an upgrader built in place to compare the database schema and automatically issues SQL statements to upgrade the database to the right schema. However, the upgrade process can take time. To save time, a backup copy is created after the upgrade finishes so that the database can be restored as necessary (See @ChangesSchema). There is one implementation for each database that we officially support so all the tests can run on all databases if we choose to.

For each table there is also a shadow table that stores the default data set up by the test environment. Before each test run, the data in each table is restored from the shadow. In this way, different tests won't step on each other's toes and end up causing other tests to fail. For performance reason, the data is only restored once for each test classes, because it is easier to make sure that the test methods in the same test class don't affect each other's data.

Server Mode for Web Testing

The QA acceptance tests are written in GScript. When running in browser mode, it uses Selenium to drive the browser to connect to the server and run tests. However, when you have enough tests, the slowness of the browser really shows. Guidewire applications are built on top of JSF framework, where the generated HTML source is driven by the page model on the server. With the exactly same script, we can run them in server mode, where the scripts are run against the page models in the server session. Without the browser layer, HTTP connection, HTML generation and parsing, the test run is cut down dramatically again.

Functional Considerations

The meta data layer of Guidewire applications is extremely extend-able and configurable, and the SQL being executed in the database layers is generated dynamically based on the metadata configuration and the database set up. It would not be practical to mock out the whole thing. The test framework provides a fixed out-of-the-box container for each test and locks it down so that the test or the code under test wouldn't accidentally try to change those dependencies. But the developers can modify the test environment through annotations. The following are the typical annotations:

@IncludeModules for Configuration Testing: With this annotation, you can specify a list of directory where the test should load the additional configuration from. In this way, you can configure the test environment (registering additional plugin, registering additional SOAP interface, extend the basic data model, add additional web pages, etc.). This is great when you want to test different configuration cases, and still leave the base configuration simple and fast.

@ChangesTime for Time-based Testing: Sometimes your test is date sensitive. With this annotation, you get a hook to change the system date on the fly before you creates the data you want so that timestamp meets your condition.

@ChangesSchema for upgrade testing: With this annotation, your test can run wild and make a havoc of the database schema. At the end of your test, the schema will be restored from the backup automatically. This is very useful for upgrader related tests.

Testing Annotations

These are the additional annotations telling the test framework how you want your test to run:

@ProductUnderTest: You can write a test, put it in a common module and tell the test framework which product you want this test to run. For example, we need to make sure that the base data model can pass the validation for all applications. We can write a test that will start the validation without being dependent on which product it is. With this annotation, the same test can be run with data model from each products. Think dependency injection on production is a good way to go? Why not apply it to test?

@TestInDatabase: From time to time, you have to implement something that is a little different for different databases, or a feature that is only applicable to one database (Oracle AWR report, for example). With this annotation, you can tell the test framework which database this test should be run against. By default, all tests are running in H2 database only for performance reason.

@DoNotRunInHarness: This is for push-button tests that cannot be run automatically. For example, we have a test that pings map point web services and make sure that we can parse the result properly. Map point ended up telling us not to ping their staging server continuously. So this test is disabled in the testing server.

Testing Semantics

There are also other productivity improvements. Your test case can now implement beforeClass(), afterClass(), beforeMethod() and afterMethod() to be run in the way, well, as the name indicated. After answered enough question about when setUp() and tearDown() are run, I think it is a nice change.

Because jUnit holds on to ALL the test instances, each fields in the test class is actually a memory leak as far as the test concerns. The test framework automatically null out all the fields (with some configurable exceptions) at the end of the test case when all the tests methods are done.

Other Considerations

This kind of test writing is also supported by our other development practices, namely ToolsHarness and Branching strategy, which I will cover in detail in later posts.

With your tests covering more code, the tests could very well break for the wrong reason. With the ToolsHarness, we were able to exam each test failure easily, locate and isolate the problems easily and the development won't grind to a halt every time there is a broken test. With the test farm provided by the ToolsHarness, our test can run concurrently so we can have better tolerance on the speed of individual test.

With the branching strategy, we are making sure that the platform code is in a good enough state before it is released to the application team.

Appendix: Things to watch out for

At the same time we creating a path to make test easier to write, we also put ourselves on a slippery slope that could lead us further and further away from effective unit testing. Sometimes it is much easier to write a test that covers a lot of than to set up the environment so that only the code you want to be tested will be tested. Why is that bad? Here is an example:

As I am writing this post, I am wrapping up a feature called "Field Level Encryption" by adding upgrade support from an earlier version of the application. It was extremely tempting to do the following:

// the column is length 6 nullable, alter it to leng 3 and not nullable
String[] sqls = getDbCatalogSupport().alterColumn(table, column).withLength(3).withNullability(false).getSql()

// Insert data that need to be updated
DatabaseTestUtil.updateInTx("insert into px_test_encryption (id) values (1)")

// run the upgrader to make sure it does not fail
new Upgrader(database).upgrade()

// run the schema checking to make sure everything is up-to-date
List error = new DatabaseSchemaVerifier(getDbCatalogSupport.buildSchema()).verifyAll()

Object[] row = assertThat().sql("select encrypted_field from px_test_encryption where id = 1", new Class[] {String.class}).hasOneRow()
assertThat().array(row).is("tluafeddefault") // null column should be updated with encrypted default value.

I am very sure that we can all agree that this is very concise and expressive. Change the database schema, insert data, run the upgrade, make sure that the schema is now up-to-date and that the row is updated correctly, just like it should be, right?

Not quite...

The problem with this test lies in the "upgrade()" and "verifyAll()" method calls. They are both very comprehensive and cover a lot of area. As a result, this test runs for a long time (over a minute). At the same time, someone could check-in a code with bug in either the upgrade code, or schema verification that has nothing to do with encryption, and this test will be broken. In an enterprise environment, you only need a small portion of tests like this to generate enough noise. And eventually developers will be so tire of spending time on a broken test only to find out that three other people are also looking at it and it will be fixed by one of them. You will start delaying looking at broken tests and they will stay broken for a long time, other changes will be applied on top of the changes that broke the test, you will have a hard time fixing them, you will start hate tests, you will write less, the quality of the product will go down...

So, for the sake of everybody, lets spent more time making the test as fine-grained as possible

// the column is length 6 nullable, alter it to leng 3 and not nullable
String[] sqls = getDbCatalogSupport().alterColumn(table, column).withLength(3).withNullability(false).getSql()

// Insert data that need to be updated
DatabaseTestUtil.updateInTx("insert into px_test_encryption (id) values (1)")

// run the upgrader to make sure it does not fail
new Upgrader(database).encryptDecrptUpgrade()

... (Some other code to verify just this schema)...

Object[] row = assertThat().sql("select encrypted_field from px_test_encryption where id = 1", new Class[] {String.class}).hasOneRow()
assertThat().array(row).is("tluafeddefault") // null column should be updated with encrypted default value.

However, this is not to say a test like the original one does not provide some value for being comprehensive. We do have upgrade tests like this for specific kind of upgrades, the ones that our customers are going to go through. Those tests will load the database from a backup so that the schema matches the ones that we release to our customer, then we run upgrader through it and verify that the schema is up-to-date. Each test also has an opportunity to insert additional data before the upgrade and do additional verification after the upgrade. In this way, when our customers get our newer build, rest assured the upgrade will not blow up horribly.