I'm in the home stretch moving Weather Now to Azure. I've finished the data model, data retrieval code, integration with the existing UI, and the code that parses incoming weather data from NOAA, so now I'm working on inserting that data into the database.
To speed up development, improve the design, and generally make my life easier, I'm using Entity Framework 5.0 with database-first modeling. The problem that consumed me yesterday afternoon and on into this morning has been how to ramp up to realistic volumes of data.
The Worker Role that will go out to NOAA and put weather data where Weather Now can use it will receive somewhere around 60,000 weather reports every hour. Often, NOAA repeats reports; sometimes, NOAA sends truncated copies of reports; sometimes, NOAA sends garbled reports. The GetWeather application (soon to be Azure worker task) has to handle all of that and still function in bursts of up to 10,000 weather reports at once.
The WeatherStore class takes parsed METARs and stores them in the CurrentObservations, PastObservations, and ClimateObservations tables, as appropriate. As I've developed the class, I've written unit tests for each kind of thing it has to do: "Store single report," "Store many reports" (which tests batching them up and inserting them in smaller chunks), "Store duplicate reports," etc. Then yesterday afternoon I wrote an integration test called "Store real-life NOAA file" that took the 600 KB, 25,000-line, 6,077-METAR update NOAA published at 2013-01-01 00:00 UTC, and stuffed it in the database.
Sucker took 900 seconds—15 minutes. In real life, that would mean a complete collapse of the application, because new files come in about every 4 minutes and contain similarly thousands of lines to parse.
This morning, I attached JetBrains dotTrace to the unit test (easy to do since JetBrains ReSharper was running the test), and discovered that 90% of the method's time was spent in—wait for it—DbContext.SaveChanges(). As I dug through the line-by-line tracing, it was obvious Entity Framework was the problem.
I'll save you the steps to figure it out, except to say Stack Overflow is the best thing to happen to software development since the keyboard.
Here's the solution:
using (var db = new AppDataContext())
db.Configuration.AutoDetectChangesEnabled = false;
// do interesting work
The result: The unit test duration went from 900 seconds to...15. And that is completely acceptable. Total time spent on this performance improvement: 1.25 hours.