The Daily Parker

Politics, Weather, Photography, and the Dog

Archive import finished; final archive import starting today

On Friday, I used Arithmetic™ to predict that the 162-million-row weather data transfer from Weather Now v3 to v5 would end around 7pm last night. Let's check the logs:

2022-04-04 18:48:30.7196|INFO|Clearing v3 archival records for ZYTX
2022-04-04 18:49:27.7471|INFO|Moved 157,408,921 weather archives from v3 to v5
2022-04-04 18:49:27.7471|INFO|Finished importing; duration 4.04:14:55.0952715 

Nice prediction. (It logged 157 million rows because I made a performance tweak and re-started the app after 5 million.)

As I've mentioned, those 162 million rows only go back to September 2009. But v3 launched in January 2007. And it turns out I have a second archive, also containing about 25 GB of data, going back to August 2006. I had to think back to decisions I never documented to piece together why.

In August 2006, I had a single big machine that served as my domain controller, Web server, and Exchange endpoint, and a second big database server. In this photo from 31 July 2006, the Exchange/Web/DC server is the white one on the right ("DOPPELKUH") and the SQL server ("BULLE") is the big black one on the left:

Now, BULLE was a huge machine for the time, with about 200 GB of disk space and (I think) 16 GB of memory. That 200 GB expanse tempted me to turn off a data-purge feature from more parsimonious days, and apparently I deployed that change around 9am on 11 August 2006. (Sadly, the purge feature worked as designed, and I have no archival data before then.)

In October 2006, I finally bought a server rack and moved the database to an even bigger set of disks:

Everything ticked along until around the time I deployed the Weather Now v3.5 refresh and discovered that the un-purged data file had grown to 25 GB. So on 3 September 2009, I simply created a new database and changed the data connection strings to point to it. That new database kept growing until I switched the archival data store to the Cloud in 2013.

And now, I get to take advantage of the triviality of that change by making an equally trivial change to the import controller's data connection string. The 2006 data file has 118,774,028 rows covering 3,988 stations from 11 August 2006 to 3 September 2009. At 435 rows per second, the final archival import should finish around...let's see...1 pm on Friday. At that point, every single byte of data Weather Now has collected in the past 15 years will be available through the App for you to see.

More from the archives:

Mondays are still long

I realize posting has slipped a little in the past couple of weeks. It should resume its normal frequency tomorrow, as I actually have five consecutive weeks of a routine schedule coming up.

That routine includes rehearsals on Mondays, though, so nothing new today.

Quick update

The Apollo Chorus performed last night at the Big Foot Arts Festival in Walworth, Wis., so I haven't done a lot of useful things today. I did take a peek at the other weather archive I have lying around, and discovered (a) it has the same schema as the one I'm currently importing into Weather Now 5, and (b) it only goes back to August 2006.

Somewhere I have older archives that I need to find... But if not, NOAA might have some.

The update rolls on...

As of 17:16 CDT, the massive Weather Now v3 to v5 import had 115,441,906 records left to transfer. At 14:28 CDT yesterday, it was at 157,409,431, giving us a rate of ( 41,967,525 / 96,480 seconds = ) 435 records per second. A little more math gives us another 265,392 seconds to go, or 3 days, 1 hour, 43 minutes left.

So, OK then, what's the over-under on this thing finishing before 7pm Monday?

It's just finished station KCKV (Outlaw Field, Clarksville, Tenn.), with another 2,770 stations left to go. Because it's going in alphabetical order, this means it's finished all of the Pacific Islands (A stations), Northern Europe (B and E), Canada (C), Africa (D, F, G, H). There are no I or J stations (at least not on Earth). K is by far the largest swath as it encompasses all of the continental US, which has more airports than any other land mass.

Once it finishes the continental US, it'll have only 38 million left to do! Whee!

Contradictory transit incentives

Two stories this morning seemed oddly juxtaposed. In good news, the City of Chicago announced plans to spend $15 million on 77 km of new bike and pedestrian trails over the next couple of years:

Several of the projects, including plans to convert an old railroad into a trail in Englewood, are still in the planning and design phases. Others, like Sterling Bay’s planned extension of the 606 Bloomingdale Trail into Lincoln Yards, are set to come to fruition through private partnerships. 

The news release lists 12 projects, including several that had been previously announced, that are set to be funded with a $15 million “commitment to jumpstart” the “key projects citywide.” The $15 million comes from a combination of “federal, state and local sources, including general obligation bonds, Tax Increment Financing, and Open Space Impact Fees,” according to a spokesperson for the city’s planning department.

The City also plans to give away 5,000 bicycles to encourage people to get out of their cars. But at the same time, the City announced it would give away 50,000 gas cards to encourage people not to get out of their cars:

The plan, which comes weeks after possible 2023 mayoral candidate Willie Wilson spearheaded several rounds of free gas giveaways, includes gas cards worth $150 each for as many as 50,000 drivers, and transit cards worth $50 each for as many as 100,000 riders. Wilson on Thursday blasted the mayor’s plan and called it a “political stunt.”

Three-quarters of the transit cards would be prioritized for residents in low-income neighborhoods who use the CTA often. The remainder would be distributed throughout the city.

“It will benefit CTA riders across the city, but especially on the South and West sides,” CTA President Dorval Carter said. “Areas that saw the lowest ridership declines during the pandemic, areas where public transit is the best and sometimes the only option.”

Let's pause for just a moment to give political-stunt-incarnate Willie Wilson a golf clap for calling anything a "political stunt."

I get fretting about gasoline prices if you do what you can to save gas and need your car to survive. But on my trip last week, I got passed by idiots in two-ton SUVs who no doubt complain it costs them $100 to fill their tank.

My little Prius got to and from Kentucky on less than $80 of gas, and even with that trip I've still gotten an average of 2.2 liters per 100 km (156 MPG) so far this year. In fact, the second-worst economy I've ever gotten for a tank of gas in this car was on the return trip from Berea, when I got 5.5 L/100 km (43 MPG) over 610 km (400 miles). Of course, since I got back I've averaged 2 L/100 km (140 MPG).

So maybe if people didn't burn as much gasoline, the city wouldn't feel like giving away gasoline was an option? Just a thought.

How long will this take?

It takes a while to transfer 162.4 million rows of data from a local SQL database to a remote Azure Tables collection. So far, after 4 hours and 20 minutes, I've transferred just over 4 million rows. That works out to about 260 rows per second, or 932,000 per hour. So, yes, the entire transfer will take 174 hours.

Good thing it can run in the background. Also, because it cycles through three distinct phases (disk-intensive data read, processor-intensive data transformation, and network-intensive data write), it doesn't really take a lot of effort for my computer to handle it. In fact, network writes take 75% of the cycle time for each batch of reports, because the Azure Tables API takes batches of 100 rows at a time.

Now, you might wonder why I don't just push the import code up to Azure and speed up the network writes a bit. Well, yes, I thought of that, but decided against the effort and cost. To do that, I would have to upload a SQL backup file to a SQL VM big enough to take a SQL Server instance. Any VM big enough to do that would cost about 67¢ per hour. So even if I cut the total time in half, it would still cost me $60 or so to do the transfer. That's an entire bottle of Bourbon's worth just to speed up something for a hobby project by a couple of days.

Speaking of cost, how much will all this data add to my Azure bill? Well, I estimate the entire archive of 2009-2022 data will come to about 50 gigabytes. The 2003-2009 data will probably add another 30. Azure Tables cost 6¢ per gigabyte per month for the geographically-redundant storage I use. I will leave the rest of the calculation as an exercise for the reader.

Update: I just made a minor change to the import code, and got a bit of a performance bump. We're now up to 381 rows per second, 46% faster than before, which means the upload should finish in only 114 hours or 4.7 days. All right, let's see if we're done early Monday morning after all! (It's already almost done with Canada, too!)

Strange data patterns

The data transfer from Weather Now v3 to v5 continues in the background. Before running it, I did a simple SQL query to find out how many readings each station reported between September 2009 and March 2013. The results surprised me a bit:

The v3 database recorded 162.4 million readings from 4,071 stations. Fully 75 of them only have one report, and digging in I can see that a lot of those don't have any data. Another 185 have fewer than 100, and a total of 573 have fewer than 10,000.

At the other end, Anderson AFB on Guam has 123,394 reports, Spangdahlem AB in Germany has 123,297, and Leeuwarden, Netherlands, has 119,533. In fact, seven Dutch weather stations account for 761,000 reports of the 162 million in the archive. I don't know why, but I'll find out at some point. (It looks like some of them have multiple weather recording devices with color designations. I'll do some more digging.)

How many should they have? Well, the archive contains 1,285 days of records. That's about 31,000 hourly reports or 93,000 20-minute updates—exactly where the chart plateaus. Chicago O'Hare, which reports hourly plus when the weather shifts significantly had 37,069 reports. Half Moon Bay, Calif., which just ticks along on autopilot without a human weather observer to trigger special reports, had 90,958. So the numbers check out pretty well. (The most prolific US-based station, whose 91,083 reports made the 10th most prolific in the world, was Union County Airport in Marysville, Ohio.)

Finally, I know that what the App has a lot of data sloppiness right now. After I transfer over these archives, I'll work on importing the FAA Airports database, which will fix the names and locations of most of the US stations.

Chugging along with data consolidation

Sunday night I finished moving all the Weather Now v4 data to v5. The v4 archives went back to March 2013, but the UI made that difficult to discover. I've also started moving v3 data, which would bring the archives back to September 2009. I think once I get that done then moving the v2 data (back to early 2003) will be as simple as connecting the 2009 import to the 2003 database. Then, someday, I'll import data from other sources, like NCEI (formerly NCDC) and the Met*, to really flesh out the archives.

One of the coolest parts of this is that you can get to every single archival report through a simple URL. For example, to see the weather in Chicago five years ago, simply go to https://wx-now.com/History/KORD/2017/03/30. From there, you can drill into each individual report (like the one from 6pm) or use the navigation buttons at the bottom to browse the data.

Meanwhile, work continues apace on importing geographic data. And I have discovered a couple of UI bugs, including a memory leak that caused the app to crash twice since launch. Oops.

* The Met has really cool archives, some of which go back to the 1850s.

Two more from Kentucky, and one from Chicago

I took Cassie for a 40-minute walk around Lexington's historic district on the way back from Berea:

The light really wasn't great, so I didn't take a lot of photos. Plus Cassie has a way of adding motion blur to every photo I shoot.

Two weeks ago I attended a conference by the Chicago River, which had dye left over from St Patrick's Day. Add a passing fire boat and it's Christmas in March: