The Daily Parker

Politics, Weather, Photography, and the Dog

Reading Excel files in code is harmful

I've finally gotten back to working on the final series of place-data imports for Weather Now. One of the data sources comes as a 20,000-line Excel spreadsheet. Both because I wanted to learn how to read Excel files, and to make updating the Gazetteer transparent, I wrote the first draft of the import module using the DocumentFormat.OpenXml package from Microsoft.

The recommended way of reading a cell using that package looks like this:

private static string? CellText(
	WorkbookPart workbook, 
	OpenXmlElement sheet, 
	string cellId)
{
	var cell = sheet.Descendants<Cell>()
		.FirstOrDefault(c => c.CellReference == cellId);
	if (cell is null) return null;

	if (cell.DataType is null || cell.DataType != CellValues.SharedString)
	{
		return cell.InnerText;
	}

	if (!int.TryParse(cell.InnerText, out var id))
	{
		return cell.InnerText;
	}
	var sharedString = workbook.SharedStringTablePart?
		.SharedStringTable
		.Elements<SharedStringItem>()
		.ElementAt(id);
	if (sharedString?.Text is not null)
	{
		return sharedString.Text.Text;
	}
	return sharedString?.InnerText is null 
		? sharedString?.InnerXml : 
		sharedString.InnerText;
}

When I ran a dry import (meaning it only read the file and parsed it without writing the new data to Weather Now), it...dragged. A lot. It went so slowly, in fact, that I started logging the rate that it read blocks of rows:

2022-05-29 18:43:14.2294|DEBUG|Still loading at 100 (rate: 1.4/s)
2022-05-29 18:44:26.9709|DEBUG|Still loading at 200 (rate: 1.4/s)
2022-05-29 18:45:31.3087|DEBUG|Still loading at 300 (rate: 1.4/s)
...

2022-05-29 22:26:27.7797|DEBUG|Still loading at 8300 (rate: 0.6/s)
2022-05-29 22:31:01.5823|DEBUG|Still loading at 8400 (rate: 0.6/s)
2022-05-29 22:35:40.3196|DEBUG|Still loading at 8500 (rate: 0.6/s)

Yes. First, it looked like it would take 4 hours to read 20,000 rows of data, but as you can see, it got even slower as it went on.

I finally broke out the profiler, and ran a short test that parsed 14 lines of data. The profiler showed a few hot spots:

  • 355,000 calls to OpenXmlElement<T>.MoveNext
  • 740,000 calls to OpenXmlCompositeElement.get_FirstChild
  • 906,000 calls to OpenXmlChildElements<GetEnumerator>.MoveNext

That's for 14 lines of data.

So I gave up and decided to export the data file to a tab-delimited text file. This code block, which opens up the Excel workbook:

using var document = SpreadsheetDocument.Open(fileName, false);
var workbook = document.WorkbookPart;
if (workbook is null)
	throw new InvalidOperationException($"The file \"{fileName}\" was not a valid data file");

var sheet = workbook.Workbook.Descendants<Sheet>().FirstOrDefault(s => s.Name == "Airports");
if (sheet is null) throw new InvalidOperationException("Could not the data sheet");

var sheetPart = (WorksheetPart)workbook.GetPartById(sheet.Id!);
var sheetData = sheetPart.Worksheet.Elements<SheetData>().First();
var rows = sheetData.Elements<Row>().Count();

Now looks like this:

var lines = File.ReadAllLines(fileName);

And the code to read the data from an individual cell becomes:

return columns.Count >= index ? columns[index] : null;

Boom. Done. Took 30 minutes to refactor. My profiler now says the most frequent call for the 14-row test occurs just 192 times, and teh whole thing finishes in 307 ms.

So let's run it against the full file, now converted to tab-delimited text:

2022-05-30 09:19:33.6255|DEBUG|Still loading at 100 (rate: 211.3/s)
2022-05-30 09:19:33.8813|DEBUG|Still loading at 200 (rate: 274.2/s)
2022-05-30 09:19:34.1342|DEBUG|Still loading at 300 (rate: 305.4/s)
...
2022-05-30 09:20:14.9819|DEBUG|Still loading at 19600 (rate: 468.6/s)
2022-05-30 09:20:15.2609|DEBUG|Still loading at 19700 (rate: 467.8/s)
2022-05-30 09:20:15.5030|DEBUG|Still loading at 19800 (rate: 467.5/s)

Well, then. The first few hundred see a 200x improvement, and it actually gets faster, so the whole import takes 45 seconds instead of 6 hours.

So much time wasted yesterday. Just not worth it.

Waiting for the cold front

It's mid-July today, at least until around 8pm, when late April should return. The Tribune reported this morning that our spring has had nearly three times the rain as last spring, but actually hasn't gotten much wetter than normal.

Meanwhile:

Finally, via The Onion, Google Maps now shows you shortcuts through people's houses when they're not home.

Spring, Summer, Spring, Summer, who knows

This week's temperatures tell a story of incoherence and frustration: Monday, 26°C; Tuesday, 16°C; yesterday, 14°C; today (so far), 27°C. And this is after a record high of 33°C just a week ago—and a low just above 10°C Tuesday morning.

So while I'm wearing out the tracks on my window sashes, I'll have these items to read while my house either cools down or warms up:

And finally, Ian Bogost feels elated that cryptocurrencies have crashed, particularly because he doesn't own any.

In Chicago, spring lasts 6 hours

At 7am Monday, it was 12°C at Inner Drive Technology World Headquarters. By 6pm the temperature had gone up to 26.5°C, then 29.8°C at 2pm Tuesday, then 29.1°C at 3:15pm yesterday, before a cold front finally ploughed through and got us down to lovely sleeping weather right before I turned in:

The slow rise in my indoor temperature from 7am to 5pm was just my normal A/C program, as was the decline when the A/C turned on at 5. Then at 6, I discovered that the cold front had gone through, so I opened the windows.

Overnight, though, this happened:

This did not lead to a restful sleep, but did apparently lead to a backache.

I'm going to leave my windows open out of optimism that the forecast is accurate and today's high will only hit 27°C. But if it's above 25°C at 6pm, I'm giving up and turning on the A/C. I need sleep.

Such is the end of spring in Chicago.

Update, 3:15pm: I tried, man. But after sweating through two meetings and watching Cassie move from the couch to the hardwood flor, I gave up and turned on the AC. Now it's 31.5°C outside and a dry 24.4°C inside:

Just one or two stories today

Sheesh:

And finally, when I left for San Francisco on Saturday morning, it was 10°C and sunny. Here we are about 76 hours later and it's 30°C. We really don't have spring or fall here some years.

The sun did come out

Chicago actually had clear skies and lovely spring weather today. That said, I'm in San Francisco this weekend, where the weather is almost exactly the same (12°C and clear).

Posting will be sporadic until Tuesday.

On the gloamin moors

Gray skies, day 45: they say the sun will come out tomorrow. I would not bet my bottom dollar on that.

In any event, I'll be in San Francisco for a couple of days, where they've had sun on and off for a while, with sun predicted tomorrow and Sunday. Then, if the predictions hold true, I'll come back here Monday in time to throw open all my windows.

We'll see. But I am really sick of the rain and clouds already.

Gloomier spring than we thought

I mentioned a couple days ago that we haven't seen the sun much this spring. Today the sun came out for only the second time in the last 43 days:

The National Weather Service categorized just one day in April as “clear and sunny,” said Kevin Donofrio, science and operations officer. NBC 5 meteorologist Paul Deanno said Tuesday just one of the past 42 days saw significant sunshine. That report was followed by another dark and soggy day.

Donofrio said this April saw 39.6 mm more rain than usual. Paul Walker, senior metrologist with AccuWeather, said there were only six days without rain last month. 

Chicago was also 1.5°C cooler than an average April, Walker said. Last year, it was 1.2°C warmer.

The gloom covered all of Illinois, in fact:

It’s not a stretch to say that most in Illinois are ready for warm weather by the time we hit mid-April. However, this year did not deliver as April temperatures were persistently well below normal. ... [M]ost days this past month were 3°C to 6°C colder than normal. Only 8 out of 30 April days in Galesburg were warmer than normal, and most by only a few degrees.

As a result of the colder weather, most of the state experienced below freezing temperatures as late as mid-April, and frost was reported as far south as the St. Louis Metro East on April 26.

Given how persistently rainy last month was, we best have a plethora of May flowers. For most places, the total amount of April precipitation was not excessive, and some areas were drier than normal. However, precipitation frequency last month was unusually high as most places recorded 12+ days with some measurable precipitation. Macomb had 22 days with measurable precipitation last month, Aurora had 20, Champaign had 19, and Cairo had 18. The 22 wet days in Macomb last month set a new record for April, which was the 2nd highest frequency of any month on record, only less than the 23 wet days in Macomb in October 2009. Frequent, small precipitation meant most places only had 2 to 2.5 days between precipitation events last month, which stymied fieldwork statewide.

Fortunately, the forecast calls for warmer and sunnier weather next week, with sun and 29°C forecast for Tuesday.