Heap Space: April 2009

Friday, April 24, 2009

If Hibernate is so great for developers, why does it make my unit tests run 400x slower?

I am insanely frustrated by how sluggish hibernate startup time is during unit tests. I have explored ways to optimize this, but I believe Hibernate has made a fundamental mistake with boot-time loading. It would be far,far better to do these sorts of checks at compile time instead of at load time. It would be so much better to do it at compile time that I consider it a moronic blunder to do it at load time.

Here are some of the things I've tried to do to speed up Hibernate load times, both for unit tests and for deploying to our dev/test tomcats.

1. Try -Dhibernate.use_lazy_proxy_factory=true

Nope, that doesn't do a thing. ~19s startup for my piddly unit test that needs to grab one row, from one table, and parse it. Without hibernate, it would take about 3.6 microseconds.

2. Make your own custom persistence.xml

Nope, this doesn't work because we have a highly interconnected schema, with over 700 tables. Some folks will point out that my organization must be brain dead for having that many tables in the same schema, but our application is all about interogating data. Besides, the schema evolved over ten years, starting back when Java was in its infancy. So don't tell me that the solution to long startup overhead in a unit test is to refactor a mult-terrabyte schema.

This interconnected-ness, even on a smaller schema, causes problems. Hibernate isn't a compiler; it's not doing static analysis to determine what Entity classes to load. It does a brute force, load-everything-at-boot-time, regardless of whether the JVM will actually ever need the classes in question. Our home grown RO tools did this once...at compile time. Ironically, developers complained about this. I thought it was great; I love static analysis. Hibernate doesn't do this once at compile time, it does it hundreds of times per day, each time a developer runs a unit test. By my calculation, the per developer hibernate tax, if you run about 50 unit tests a day during your personal development work, eats at least 25 minutes. It's not just 25 minutes, though, since each occurence is a distraction, and distractions are a waste-multiplier (like the military's "force multiplier", only for wasting time).

3. Just write your own weird class loader.

Seriously? If Hibernate is a tool whose goal is dumbing-down SQL and relational databases, do you really think that your average Hibernate user is going to be able to navigate a custom class loader?

4. Re-architect your code so that all database access goes through a separately deployed, long-lived service layer like a web service or RMI.

Everything that makes "option" 2 above impossible also makes this advice worthless. When you have a large (table-wise), highly interconnected schema, you can't just start shuffling logical subgroups of tables into separate schemas without spending a few years refactoring everything to go through those service layers.

TL;DR: Please, Hibernate, I beg you: move load-time checks into compile-time checks, or just optimize the bejeezus out of whatever is going on during load time so that it takes less than 2 seconds.

Thursday, April 23, 2009

Exception-Driven Development and Breadth-First Coding

We've all heard about test-driven development: write your tests first, then write the code. This forces you to think about what you're going to do before you do it, which is always a good thing to do. I don't take it as dogma, but it's a helpful way to think about programming.

Here's another one: exception-driven development. When you sit down to write a method, think about how it could explode and write some scaffold-like exception handling. If you do this right, then you should have lots of red junk show up in your IDE. The idea isn't to totally write all the error detection code first; just write a scaffold.

For example, in my current project I'm talking TCP to a flatbed rack scanner that scans racks full of 2D-barcoded tubes--thanks, Ziath! It should be impossible for the same 2D-barcoded tube to appear in more than one slot (well) in the rack, so I want to be sure to detect this because it'll have horrible downstream consequences. So I wrote some stube code like so:



if (haveSeenTubeBarcodeAlready(tubeBarcode)) {
     throw new ScanningException("Barcode " + tubeBarcode + " appears in multiple wells in rack " + rackBarcode);
}

Notice that it won't compile. That's sort of the point. It's half baked. I've found that always having code that compiles isn't necessarily the best way to work. I tend to write code by stubbing-out uncompilable chunks of 10-20 lines of code like this, work on implementing a few lines here, a few lines there, bouncing back and forth implementing a portion of this method, a little bit of this exception-handling code, a smidgen of this loop. It's a breadth-first approach to coding.

The depth-first approach to coding, where you focus intently on implementing a larger chunk of code (like an entire 50 line method) often results in a lot of wasted time because you quickly get very detail-oriented and distracted. For example, you might spend three hours writing a nice, tight method, with lovely unit tests, only to realize that you actually needed to pass in two other parameters that you need to consult in some conditional logic.

This approach doesn't work well for everything, which I think is why it's so counterintuitive. If you painted your living room by rolling a roller on one portion of a wall, then stroked a few brush strokes on another wall, then did some sanding on a different wall, only to then put some paint tape and plastic down on the floor and paint three inches of trim, it would be a disaster.

If this doesn't make sense, you can just write //todo's all over the method about how you should deal with error handling. The basic idea is to think about error handling first, and then don't forget about it when you actually write the implementation.

Tuesday, April 21, 2009

Conditional breakpoints are my new best friends.

Intellij's IDEA has a wonderful feature I discovered the other day: conditional breakpoints. Because the applications I write tend to operate on large amounts of data, the bugs I tend to see are the sort that show up once every thousand or so iterations through some loop. Often the defect is the result of not totally bomb-proofing the application to deal with dirty data. We tend to stream data around, so we don't get nice Lists--often we just get Iterators or Iterables, and for performance reasons, we often don't impose any reproducible order on these collections (so we tend not to use Comparators and my favorite interview question whose answer is LinkedHashMap).

The result is that you can't just pause at position 3721 in the List. You have to wait for the chunk of data that has exactly the structure you're looking for. Enter conditional breakpoints: right-click the breakpoint, go to properties, click on "Condition". What's really great is that the expression box is itself a fully featured mini-IDE, so you get autocomplete goodness. Sweet!

Sunday, April 19, 2009

The No-Argh! Constructor and Dependency Injection

I've done a few code reviews lately and been really puzzled by the proliferation of no-arg constructors, or as I call them, no-argh! constructors. The developers who make these objects tend to justify them in one of two ways:

1. Total ignorance
2. "The framework made me do it."

Ignorance, as my septugenarian neighbor told me in the yard the other day, can be fixed with education. So here's the education I give people who just seem ignorant of the value of constructors:

1. Constructors clearly communicate what your object needs to operate. Even without documentation, auto-complete in your IDE will tell you what the object needs in order to function.
2. In the current environment of fear-of-multicore-and-latent-concurrency-bugs, making a nice constructor lets you mark fields as final and encourages object immutability, thus rendering your code threadsafe without your having to think at all about complicated synchronization policies.

I usually then go through the exercise of newing-up one of their no-argh! constructors and asking "Now how do I know which fields to set?" Often, their classes aren't missing just constructors--they're missing documentation of all sorts, so it's totally unclear which fields I should set (which relates to my other post about thinking twice about autogenerating setters and getters for everything).

Very often at this point the light bulb pops up and the developer gets it, at which point I congratulate myself for improving all future code the developer makes and justifying my outrageous salary.

But what about the folks who use the framework as the scapegoat for their no-argh! constructors? Thou shalt use constructor injection. Any framework without constructor injection is at this point worthless. Even Spring now has constructor injection. I absolutely refused to use Spring until it had constructor injection. I considered it a fatal flaw with Spring, and I'm still highly suspicious of Spring in general because of this blunder, but I'll admit that Spring does have some very nice features.

Friday, April 17, 2009

Deploying JPA dependent code into Spring CMT and non-CMT applications makes a big mess.

I use the Spring framework because my organization insists that I do. I try to not waste too much time complaining about it; the decision was made and I want to be a good corporate citizen. Yesterday I met this Exception:

Not allowed to create transaction on shared EntityManager - use Spring transactions or EJB CMT instead.

What got me here? I wrote a method that calls a "native" SQL query (BTW, I am offended that JPA decided to stick "native" in front of anything involving programmer-defined SQL. "Native" is C/C++, or any language that's close to the metal. SQL defines data relationships, and is probably farther from the metal in the database than Java is from a register). Prior to calling the query, I'd like to flush() the EntityManager so that a client doesn't accidentally leave some data in the EntityManager and then get confused by the fact that the method I'm writing doesn't pickup the data they thought they'd written to the database. Basically I want to save them from themselves, just by a simple call to flush(). I could just mention this in the javadocs, but calling flush() is so simple I figure I'd do it in my method.

Works great in my sandbox; explodes for another developer with the exception above. What gives?

In my sandbox, I'm newing things up on my own, explictly creating a new EntityManager and an instance of my class. In the other guy's environment, he's injecting the stuff through Spring. So I have to make my class Spring compliant and not Spring compliant at the same time because one client uses Spring and the other doesn't. So now I need to know the intricacies of the framework into which my code will be deployed, while making sure that it works in the absence of the framework. But the raison d'etre of the framework in question (Spring) is to save me from knowing all the details of transactions. Instead, I have to know all the details of Spring. It's not really 6 of one, half a dozen of the other--Spring is supposed to make my life easier, but it doesn't. It just forces me to let go of some details with which I am very familiar, comfortable, and adept at working with, and pickup a new bunch of new abstractions that so far I find pointlessly cumbersome. If I were a junior programmer, with no working knowledge of JDBC, I'd be utterly clueless about transactions, since they're all hidden by CMT. I find this a disturbing trend because programmers who work with databases and don't know anything about transactions are dangerous.

At least the JPA docs offer a clue here, stating that getTransaction() can throw IllegalStateException if called on a JTA EntityManager. So it's at least clear that you have to worry about the framework in which your code is going to run.

Clearly you have to test your code in each framework into which it will be deployed. If you were just using one framework (or no framework), you could rest assured via static analysis and documentation that things were going to work one way, all the time. Using a mixture of frameworks (or a mixture of frameworks and no frameworks, which we do) takes that away from you and requires that you write explicit tests per-framework, which often requires obtuse XML configuration (hello applicationContext.xml and copy-pasted applicationContext-test.xml).

In short, you give up static analysis (something that serious CS people are really, really good at) and instead spend your time:
1. Writing an increased volume of tedious unit tests and configuration files, often feeling like your unit tests are really testing your XML configuration. The XML configuration is so complex that it really does need its own tests, so you can't skimp on this and it becomes part of your code "overhead" (tax).
2. Wrapping your head around declarative transactions

All for the alleged benefit of CMT (which seems mostly just to avoid the minor nuisances of passing around Connections and being careful about calling close() in finally{} blocks).

Tuesday, April 14, 2009

Declarative transactions replace old complexity with newer, more awful complexity.

JDBC is a bit hard to use in the raw. Dealing with cursor leaks and making sure that you've closed things in finally{} clauses is tedious. But it's well understood. I don't mind dealing with it. In fact, I think application programmers really need to understand transactions in order to code the application properly. That's not to say that I want to force JDBC down everyone's throat. I just want application programmers to appreciate that talking to a relational database does require some understanding of what a relational database does.

Declarative transactions were supposed to make transaction management simple. They don't. They just make it confusing in new ways that involve annotations, XML, third party frameworks, and good god, AOP. All of this just to avoid a few finally{} clauses and a few calls to close{}? Take a look at this 15 page article from developerworks. 15 pages, just for some common transaction pitfalls. Something smells funny.

TL;DR: Transaction management is still complicated. I liked it better when I could deal with it in plain Java instead of having to learn Spring, AOP, and annotations.

Thursday, April 9, 2009

The Human Genome Project is history. Now what?

I've had the luxury of working at my current job for nearly 10 years. When I graduated from school with a CS degree in 1999, people were getting paid $60k to write HTML, with the only qualification that you have a pulse. At one interview at the time, I mentioned this "new" thing called XML (which actually was relatively new in 1999), and the CIO of one company said "Oh yeah, I need to look into that." I didn't end up working for that company.

Anyhoo, I decided to work for this great place that was ramping up its work on the Human Genome Project. For the next few years, at cocktail parties I could actually attract a crowd by telling people I worked on the Human Genome Project. Ah, those were the days. When I catch up with friends nowadays, they invariably ask me what has become of the HGP. What the heck am I getting paid to work on, now that the HGP has been done for the better part of a decade?

The technology for doing genome sequencing has evolved by leaps and bounds. We're not quite at GATTACA yet, but I no longer consider the rapid DNA sequencing in the movie pure science fiction. So where are we now with DNA sequencing? I'll answer by way of analogy.

Imagine the year is 1979, 10 years after the Apollo landing. The space shuttle hasn't even lifted off for the first time. NASA's technology has found its way into all kinds of interesting practical uses, but people aren't zooming around the globe much faster. Spaceplanes still haven't shown up. Rapid sub-orbital intercontinental travel still isn't available today, 40 years later.

If space travel had progressed with the speed of DNA sequencing and genome biology in general, by 1979 the US alone would have been launching spaceplanes about twice a day. Other countries would have built similar facilities. Suborbital flights would have been fairly routine, but not quite yet widespread. Still expensive, but in the $10k range. This would have profoundly changed the world in numerous ways. Just think of how it would change the shipping industry, and how this would ripple through other sectors of the economy to effect daily life. At this pace, by about 1990 (which in my comparison puts us at about 2020), people would be zooming around in spaceplane buses (I doubt folks will ever have their own space plane for a variety of technical and safety reasons). What's the genome biology equivalent here? Right now there are ways to sequence a single human genome in about a week for under $20k. The price continues to plummet, by the way. Illumina, Roche, and SOLiD are some of the more popular technologies. Pacific Biosciences and Complete Genomics are two other companies currently developing the next-next generation of sequencing systems.

But so what? Why should we want to sequence a genome for $20k? Many of us don't. We want to sequence fractions of genomes from thousands of patients. The price per genome is an interesting point of comparison, but it misleads people into thinking that what we want to do is sequence all parts of a genome. Sometimes that's useful and sometimes it's an incredible waste. Imagine if UPS only sold shipping services in 747-sized batch. You couldn't send a small box without purchasing an entire 747 flight. Focusing on the cost-per-genome number misses the fact that there are thousands of researchers and clinicians out there who could greatly improve your health by just sequencing a tiny fraction of your genome, if only they could order a few kilobases--or even hundreds of bases--cheaply, quickly, and accurately. Many of us are focused on inventing these massively parallel, multiplexed sequencing systems.

So why is this a good thing to throw money at?

Because doing so will bring about incredible new ways to accurately diagnose (and eventually treat) the myriad forms of cancer. Because doing so will bring about tremendous advances in the way we treat and diagnose terrible infectious diseases and drug resistant strains of all sorts of nasty bugs. What's the timeline? It's hard to say, but I'm banking on two or three generations being able to look back at how we treat cancer and infectious disease now and being as repulsed as we are now by Civil War era "surgery".

To use another analogy, if car mechanics worked the way our health providers worked, they would replace the engine every time you needed your oil changed because they couldn't tell that the problem was your oil. In the rare circumstance that the mechanic could deduce that the oil was the problem, he would replace it with olive oil because he couldn't tell what kind of oil your particular kind of car needs. This isn't a diss on health care providers. They're doing the best they can with what limited knowledge they can glean from current technology. Doing all of this sequencing--along with all the other painstaking population genetics and clinical work--will bring about incredible health benefits.

But what in particular are they paying me for? To help design and build the software systems that keep track of many terrabytes of sequence data that we produce each day.

Heap Space