The Daily Parker

Politics, Weather, Photography, and the Dog

Anthony's Song

I'm movin' out. A lovely young couple have offered to buy Inner Drive World Headquarters v5.0, and the rest of the place along with it. I've already gotten through the attorney-review period for IDTWHQ v6.0, so this means I'm now more likely than not to move house next month.

Which means I have even less time to read stuff like this:

Finally, American Airlines plans to get rid of its First Class offerings, replacing them with high-tech Business Class and more premium coach seats. I'd better use my miles soon.

Notable Friday afternoon stories

Just a few before I take a brick to my laptop for taking a damned half-hour to reformat a JSON file:

Oh, good. My laptop has finished parsing the file. (In fairness it's 400,000 lines of JSON, but still, that's only 22 megabytes uncompressed.) I will now continue with my coding.

God save our gracious King

With the death of Queen Elizabeth II, the British National Anthem has changed back to "God Save the King" for the third time in 185 years. In other news:

By the way, the UK has a vacancy for the post of Prince of Wales, in case anyone would care to apply. I think we can bet on nepotism, though.

Lunchtime links

Happy Monday:

I would now like to take a nap, but alas...

Future heat

James Fallows highlights a new US government website that maps how bad the climate will get in your town:

Let me give just a few illustrations from the first such climate-based public map the White House has released, HEAT.gov. The main points about all this and related “digital dashboards” (like the one for Covid) and maps:

  • They are customizable. You can see your immediate neighborhood, or the entire world.
  • They are configurable. You can see the “real” weather as of 2020, and the projected weather as of many decades from now.
  • They can be combined. You can overlay a map of likely future flood zones, with areas of greatest economic and social vulnerabilities.

First, a map showing the priority list of communities most at risk from heat stress some decades from now. This is based on an overlay of likely future temperatures, with current resources and vulnerabilities, and other factors and trends.

Number one on this future vulnerability list is in the Rio Grande Valley of Texas. Number ten is in Arkansas. In between, at number seven, is my own home county in California. You can tune the map to your own interests here. It is meant to serve as a guide for preparation, avoidance, and resilience.

Pretty cool stuff. At the moment, Chicago's weather seems pretty reasonable for July, but the forecast calls for hot and awful weather later this week. And that will keep happening as climate change keeps pushing more energy into the atmosphere.

Tuesday morning...uh, afternoon reading

It's a lovely day in Chicago, which I'm not enjoying as much as I could because I'm (a) in my Loop office and (b) busy as hell. So I'll have to read these later:

Finally, Mick Jagger turns 79 today, which surprised me because I thought he was closer to 130.

Missed anniversary, weather app edition

I've been a little busy this weekend so even though I remembered that yesterday was the 25th anniversary of Harry Potter's publication, I forgot that Friday was the 25th anniversary of Weather Now v0. Yes, I've had a weather application on the Internet for 25 years.

The actual, standalone Weather Now application launched on 11 November 1999, which I plan to make a bigger deal of. And that comes after this blog's 25th anniversary (13 May 1998). But it's kind of cool that I have an app running continuously since the early days of Bill Clinton's second term.

Reading Excel files in code is harmful

I've finally gotten back to working on the final series of place-data imports for Weather Now. One of the data sources comes as a 20,000-line Excel spreadsheet. Both because I wanted to learn how to read Excel files, and to make updating the Gazetteer transparent, I wrote the first draft of the import module using the DocumentFormat.OpenXml package from Microsoft.

The recommended way of reading a cell using that package looks like this:

private static string? CellText(
	WorkbookPart workbook, 
	OpenXmlElement sheet, 
	string cellId)
{
	var cell = sheet.Descendants<Cell>()
		.FirstOrDefault(c => c.CellReference == cellId);
	if (cell is null) return null;

	if (cell.DataType is null || cell.DataType != CellValues.SharedString)
	{
		return cell.InnerText;
	}

	if (!int.TryParse(cell.InnerText, out var id))
	{
		return cell.InnerText;
	}
	var sharedString = workbook.SharedStringTablePart?
		.SharedStringTable
		.Elements<SharedStringItem>()
		.ElementAt(id);
	if (sharedString?.Text is not null)
	{
		return sharedString.Text.Text;
	}
	return sharedString?.InnerText is null 
		? sharedString?.InnerXml : 
		sharedString.InnerText;
}

When I ran a dry import (meaning it only read the file and parsed it without writing the new data to Weather Now), it...dragged. A lot. It went so slowly, in fact, that I started logging the rate that it read blocks of rows:

2022-05-29 18:43:14.2294|DEBUG|Still loading at 100 (rate: 1.4/s)
2022-05-29 18:44:26.9709|DEBUG|Still loading at 200 (rate: 1.4/s)
2022-05-29 18:45:31.3087|DEBUG|Still loading at 300 (rate: 1.4/s)
...

2022-05-29 22:26:27.7797|DEBUG|Still loading at 8300 (rate: 0.6/s)
2022-05-29 22:31:01.5823|DEBUG|Still loading at 8400 (rate: 0.6/s)
2022-05-29 22:35:40.3196|DEBUG|Still loading at 8500 (rate: 0.6/s)

Yes. First, it looked like it would take 4 hours to read 20,000 rows of data, but as you can see, it got even slower as it went on.

I finally broke out the profiler, and ran a short test that parsed 14 lines of data. The profiler showed a few hot spots:

  • 355,000 calls to OpenXmlElement<T>.MoveNext
  • 740,000 calls to OpenXmlCompositeElement.get_FirstChild
  • 906,000 calls to OpenXmlChildElements<GetEnumerator>.MoveNext

That's for 14 lines of data.

So I gave up and decided to export the data file to a tab-delimited text file. This code block, which opens up the Excel workbook:

using var document = SpreadsheetDocument.Open(fileName, false);
var workbook = document.WorkbookPart;
if (workbook is null)
	throw new InvalidOperationException($"The file \"{fileName}\" was not a valid data file");

var sheet = workbook.Workbook.Descendants<Sheet>().FirstOrDefault(s => s.Name == "Airports");
if (sheet is null) throw new InvalidOperationException("Could not the data sheet");

var sheetPart = (WorksheetPart)workbook.GetPartById(sheet.Id!);
var sheetData = sheetPart.Worksheet.Elements<SheetData>().First();
var rows = sheetData.Elements<Row>().Count();

Now looks like this:

var lines = File.ReadAllLines(fileName);

And the code to read the data from an individual cell becomes:

return columns.Count >= index ? columns[index] : null;

Boom. Done. Took 30 minutes to refactor. My profiler now says the most frequent call for the 14-row test occurs just 192 times, and teh whole thing finishes in 307 ms.

So let's run it against the full file, now converted to tab-delimited text:

2022-05-30 09:19:33.6255|DEBUG|Still loading at 100 (rate: 211.3/s)
2022-05-30 09:19:33.8813|DEBUG|Still loading at 200 (rate: 274.2/s)
2022-05-30 09:19:34.1342|DEBUG|Still loading at 300 (rate: 305.4/s)
...
2022-05-30 09:20:14.9819|DEBUG|Still loading at 19600 (rate: 468.6/s)
2022-05-30 09:20:15.2609|DEBUG|Still loading at 19700 (rate: 467.8/s)
2022-05-30 09:20:15.5030|DEBUG|Still loading at 19800 (rate: 467.5/s)

Well, then. The first few hundred see a 200x improvement, and it actually gets faster, so the whole import takes 45 seconds instead of 6 hours.

So much time wasted yesterday. Just not worth it.

The new guys

I spent today bringing two new developers up to speed on our software and architecture, and in a little bit we're going to a project kickoff dinner. So while I recognize that News have happened, I have no time to report even a single New this evening.