The Daily Parker

Politics, Weather, Photography, and the Dog

Reading Excel files in code is harmful

I've finally gotten back to working on the final series of place-data imports for Weather Now. One of the data sources comes as a 20,000-line Excel spreadsheet. Both because I wanted to learn how to read Excel files, and to make updating the Gazetteer transparent, I wrote the first draft of the import module using the DocumentFormat.OpenXml package from Microsoft.

The recommended way of reading a cell using that package looks like this:

private static string? CellText(
	WorkbookPart workbook, 
	OpenXmlElement sheet, 
	string cellId)
{
	var cell = sheet.Descendants<Cell>()
		.FirstOrDefault(c => c.CellReference == cellId);
	if (cell is null) return null;

	if (cell.DataType is null || cell.DataType != CellValues.SharedString)
	{
		return cell.InnerText;
	}

	if (!int.TryParse(cell.InnerText, out var id))
	{
		return cell.InnerText;
	}
	var sharedString = workbook.SharedStringTablePart?
		.SharedStringTable
		.Elements<SharedStringItem>()
		.ElementAt(id);
	if (sharedString?.Text is not null)
	{
		return sharedString.Text.Text;
	}
	return sharedString?.InnerText is null 
		? sharedString?.InnerXml : 
		sharedString.InnerText;
}

When I ran a dry import (meaning it only read the file and parsed it without writing the new data to Weather Now), it...dragged. A lot. It went so slowly, in fact, that I started logging the rate that it read blocks of rows:

2022-05-29 18:43:14.2294|DEBUG|Still loading at 100 (rate: 1.4/s)
2022-05-29 18:44:26.9709|DEBUG|Still loading at 200 (rate: 1.4/s)
2022-05-29 18:45:31.3087|DEBUG|Still loading at 300 (rate: 1.4/s)
...

2022-05-29 22:26:27.7797|DEBUG|Still loading at 8300 (rate: 0.6/s)
2022-05-29 22:31:01.5823|DEBUG|Still loading at 8400 (rate: 0.6/s)
2022-05-29 22:35:40.3196|DEBUG|Still loading at 8500 (rate: 0.6/s)

Yes. First, it looked like it would take 4 hours to read 20,000 rows of data, but as you can see, it got even slower as it went on.

I finally broke out the profiler, and ran a short test that parsed 14 lines of data. The profiler showed a few hot spots:

  • 355,000 calls to OpenXmlElement<T>.MoveNext
  • 740,000 calls to OpenXmlCompositeElement.get_FirstChild
  • 906,000 calls to OpenXmlChildElements<GetEnumerator>.MoveNext

That's for 14 lines of data.

So I gave up and decided to export the data file to a tab-delimited text file. This code block, which opens up the Excel workbook:

using var document = SpreadsheetDocument.Open(fileName, false);
var workbook = document.WorkbookPart;
if (workbook is null)
	throw new InvalidOperationException($"The file \"{fileName}\" was not a valid data file");

var sheet = workbook.Workbook.Descendants<Sheet>().FirstOrDefault(s => s.Name == "Airports");
if (sheet is null) throw new InvalidOperationException("Could not the data sheet");

var sheetPart = (WorksheetPart)workbook.GetPartById(sheet.Id!);
var sheetData = sheetPart.Worksheet.Elements<SheetData>().First();
var rows = sheetData.Elements<Row>().Count();

Now looks like this:

var lines = File.ReadAllLines(fileName);

And the code to read the data from an individual cell becomes:

return columns.Count >= index ? columns[index] : null;

Boom. Done. Took 30 minutes to refactor. My profiler now says the most frequent call for the 14-row test occurs just 192 times, and teh whole thing finishes in 307 ms.

So let's run it against the full file, now converted to tab-delimited text:

2022-05-30 09:19:33.6255|DEBUG|Still loading at 100 (rate: 211.3/s)
2022-05-30 09:19:33.8813|DEBUG|Still loading at 200 (rate: 274.2/s)
2022-05-30 09:19:34.1342|DEBUG|Still loading at 300 (rate: 305.4/s)
...
2022-05-30 09:20:14.9819|DEBUG|Still loading at 19600 (rate: 468.6/s)
2022-05-30 09:20:15.2609|DEBUG|Still loading at 19700 (rate: 467.8/s)
2022-05-30 09:20:15.5030|DEBUG|Still loading at 19800 (rate: 467.5/s)

Well, then. The first few hundred see a 200x improvement, and it actually gets faster, so the whole import takes 45 seconds instead of 6 hours.

So much time wasted yesterday. Just not worth it.

The new guys

I spent today bringing two new developers up to speed on our software and architecture, and in a little bit we're going to a project kickoff dinner. So while I recognize that News have happened, I have no time to report even a single New this evening.

Could have saved some money

I just discovered that Azure App Services allows you to create one free managed certificate per App Service. For Weather Now, I spent $140 creating two certificates, when really I only cared about the one (for https://www.wx-now.com).

Microsoft explains:

The free App Service managed certificate is a turn-key solution for securing your custom DNS name in App Service. It's a TLS/SSL server certificate that's fully managed by App Service and renewed continuously and automatically in six-month increments, 45 days before expiration, as long as the prerequisites set-up remain the same without any action required from you. All the associated bindings will be updated with the renewed certificate. You create the certificate and bind it to a custom domain, and let App Service do the rest.

The free certificate comes with the following limitations:

  • Does not support wildcard certificates.
  • Does not support usage as a client certificate by using certificate thumbprint (removal of certificate thumbprint is planned).
  • Does not support private DNS.
  • Is not exportable.
  • Is not supported on App Service Environment (ASE).
  • Only supports alphanumeric characters, dashes (-), and periods (.).

That will make a big difference going forward, and saved me $70 for the emergency Inner-Drive.com port going on this week...

I broke something

When I deployed Weather Now v5 recently, I set about deleting a lot of stuff that cost money to run in Azure but no longer served any purpose. The upgrade was always going to be a one-way trip, which is why I waited so long. (That, and the upgrade necessitated a few one-time charges that I wanted to put off.)

It appears, however, that I deleted something the Inner Drive Technology brochure site needed. Oops. And the old IDT site runs on .NET Framework 4.7, so I can't even modernize it in place.

So...let's see how fast I can port it to .NET 6, shall we?

Head (and kittens) exploding!

Leading off today's afternoon roundup, The Oatmeal (Matthew Inman) announced today that Netflix has a series in production based on his game Exploding Kittens. The premise: God and Satan come to Earth—in the bodies of cats. And freakin' Tom Ellis is one of the voices, because he's already played one of those parts.

Meanwhile, in reality:

  • A consumers group filed suit against Green Thumb Industries and three other Illinois-based cannabis companies under the Clayton Act, alleging collusion that has driven retail pot prices above $8,800 per kilo. For comparison, the group alleges that retail prices in California are just $660 per kilo. (Disclosure: The Daily Parker is a GTI shareholder.)
  • Illinois Governor JB Pritzker (D), one of the indirect defendants in the pot suit, signed a $46 billion budget for the state that includes $1.8 billion in temporary tax relief. Apparently, I'll get a $50 check from the State that I can apply to the $600 increase in property taxes Cook County imposed this year, which is nice, but I think the state could have aimed a bit lower on the income cap for that rebate and given more help to other people.
  • Shortly after US District Court Judge Kathryn Kimball Mizelle (a 35-year-old who never tried a case and who graduated summa cum mediocrae laude from the legal powerhouse University of Florida just 8 years ago and earned a rare "not qualified" rating from the ABA upon her appointment in 2020 by the STBXPOTUS) ruled against the CDC in a case brought by an anti-masker, the DOT dropped mask mandates for public transport and air travel in the US. In related news, the Judge also said it's OK to piss in other people's swimming pools and up to the other swimmers not to drink the water.
  • While the Chicago Piping Plovers organization waits for Monty and Rose to return to Montrose Beach, another one of the endangered birds has landed at Rainbow Beach on the South Side. He appears more inclined to rent than buy, but local ornithologists report the bird has a new profile on the Plōvr dating site.
  • NBC breaks down the three biggest factors driving inflation right now, and yes, one of them is president of Russia. None, however, is president of the US.
  • Along those lines, (sane) Republican writer Sarah Longwell, who publishes The Bulwark, found that 68% of Republicans believe the Big Lie that the XPOTUS won the 2020 election, but "the belief that the election was stolen is not a fully formed thought. It’s more of an attitude, or a tribal pose." Makes me proud to be an American!

And finally, via Bruce Schneier, two interesting bits. First, a new paper explains how a bad actor can introduce a backdoor into a machine learning training session to force specific outcomes (explained in plain English by Cory Doctorow). Second, an attacker used a "flash loan" to take over the Beanstalk crypto currency voting system and stole $182 million from it. Because Crypto Is The Future™.

It's 5pm somewhere

Actually, it's 5pm here. And I have a few stories queued up:

Finally, author John Scalzi puts Rogue One in third place on his ranked list of Star Wars films, with some good reasons.

Web3 is coming for your kitchen

Via Molly White, a new company called Gripnr wants to monetize your D&D campaign, and it's as horrible as it sounds:

Gripnr plans to generate 10,000 random D&D player characters (PCs), assign a “rarity” to certain aspects of each (such as ancestry and class), and mint them as non-fungible tokens, or NFTs. Each NFT will include character stats and a randomly-generated portrait of the PC designed in a process overseen by Gripnr’s lead artist Justin Kamerer. Additional NFTs will be minted to represent weapons and equipment.

Next, Gripnr will build a system for recording game progress on the Polygon blockchain. Players will log into the system and will play an adventure under the supervision of a Gripnr-certified Game Master. After each game session is over, the outcome will be logged on-chain, putting data back onto each NFT via a new contract protocol that allows a single NFT to become a long record of the character’s progression. Gripnr will distribute the cryptocurrency OPAL to GMs and players as in-game capital. Any loot, weapons, or items garnered in-game will be minted as new sellable NFTs on OpenSea, a popular NFT-marketplace.

As a D&D veteran who once played a character (for 5 minutes) with Gary Gygax* as DM, I can't see how any gamer would want to do this. Molly White has spent the last two years documenting the ways scammers and grifters have used "the blockchain" and "NFTs" and other Web3 buzzwords to steal (or, as I believe, launder) billions of dollars. Gripnr seems like just one more scam, but I could be wrong: Gripnr could just be a lazy get-rich-quick scheme for its creators.

All the rows in the world

When I launched the final weather archive import on Tuesday, I predicted it would finish around 1pm today. See my accuracy for yourself:

2022-04-08 12:54:05.0975|INFO|Moved 118,773,651 weather archives from v3 to v5
2022-04-08 12:54:05.0975|INFO|Finished importing; duration 3.03:41:19.2445019
2022-04-08 12:54:05.0975|INFO|Import finished

Not a bad prediction.

So Weather Now 5 now has about 260 million historical records going back to 2006, including Chicago's weather from 15 years ago this hour. And where the weather station reported climate records, we've got those too.

Microsoft Azure recalculates storage use daily around 11 am Central time, so I don't have the complete picture yet, but it looks like I transferred about 245 GB of data. I'll find out for sure tomorrow, and in 3-4 days I'll get an accurate view of the storage cost.

Whew. I'm glad that's over.

Somebody call lunch!

I've gotten two solid nights of sleep in a row, and I've got a clean desk for the first time in weeks. I hope that this becomes the norm, at least until November, when I'll have a packed musical schedule for six weeks as the Apollo Chorus rehearses or performs about 30 times. But that's seven months off.

That gives me plenty of time to listen to or read these:

And finally, in compiling geographic source data for Weather Now, I discovered that the International Civil Aviation Organisation (ICAO) assigned an official designator the location where the Ingenuity helicopter landed on Mars: JZRO, for Jezero Crater.