The Daily Parker

Politics, Weather, Photography, and the Dog

How I may spend my entire weekend

The Census and the National Archives have released the entire 1940 enumeration quasi-digitally. I think the data drop is great. I am going to download a few specific documents based on what I know about my own family, and about some of the places I've lived that were around in April 1940.

But as a software developer who works mainly with Cloud-based, large-data apps, I am puzzled by some of the National Archives' choices.

I say "quasi-digitally" because the National Archives didn't enter all the tabulated data per se; instead they scanned all the documents and put them out as massive JPEG images. I'm now downloading the data for one census tract, and the 29 MB ZIP file is taking forever to finish. The actual data I'm looking would take maybe 1-2 kB. That said, I understand it's a massive undertaking. There are hundreds of thousands of pages; obviously entering all the data would cost too much.

But this goes to the deeper problem: The Archives knew or should have known that they'd get millions of page views and thousands of download requests. So I need to ask, why did they make the following boneheaded technical decisions?

  • They used classic ASP, an obsolete technology I haven't even used since 2001. The current Microsoft offering, ASP.NET MVC 3, is to classic ASP what a Boeing 787 is to a DC-3. It's an illuminated manuscript in the era of steam-driven presses.
  • They organized the data by state and city, which makes sense, until you get to something the size of Chicago. Northfield Township, where I grew up, takes up one map and about 125 individual documents. Chicago has over 100 maps, which you have to navigate from map #1 to the end, and a ridiculous number of individual documents. You can search for the census tract you want by cross streets, but you can't search for the part of the city map you want by any visible means.
  • I'm still waiting for my 32-page document after 22 minutes. Clearly the Archives don't have the bandwidth to handle this problem. Is this a budget issue? Perhaps Microsoft or Google could help here by donating some capacity until the rush is over?

In any event, once I get my documents, I'm going to spend some time going over them. I really want to find out what kind of people lived in my current apartment 70 years ago.

Update: The first download failed at 1.9 MB. The second attempt is at 6.6...and slowing down...

Update: The second and third attempts failed as well. I have, however, discovered that they've at least put the data out on Amazon Web Services. So...why are the downloads pooping out?

Comments are closed