So Long, XML. Hello, RegEx.

So a few years ago, as I was clicking the links in my blogroll for the 10th time that day, I wondered if there was a better way of finding out when my favorite sites were updated. This was before feed readers were really popular, and anyway I wanted more of a Google News layout. Then I had an epiphay: site feeds were just XML files, and I had already created a bunch of XML-processing script in ASP for my Bleeker Books site (and God help me, it’s still using them).

So over a long weekend I cobbled together a program that would read these feeds into a database and spit them back out in a nice format, and CrimeSpot was born.

Over the next year or so I recoded the site in .NET, and I have to say those tools made it a lot easier. But I still ran into a problem from time to time, one that I couldn’t do anything about: every so often, I couldn’t import a feed. It would have some sort of formatting problem that made it an invalid XML document, and my program would throw up its hands and give up.

Usually this was because of Microsoft Word – if you copy the contents of a Word document and paste it as HTML, a lot of the formatting information gets converted in a weird way. In particular, you end up with a lot of tags that look like <o:p>. To XML, that looks like an undefined namespace, and the document can’t be read. More broadly, any error anywhere in an XML document causes the entire document to be unreadable.

As I said, this has been going on for a while, but as I add feeds I can see that it’s going to be a more and more common problem. So I finally made a command decision. Processing these documents as XML is out. From now on I’m going to use regular expressions to extract the data I want.

For those of you not in the know, regular expressions are pattern matching tools that can find and extract information from a longer document. In practical terms, this means that as long as the tags surrounding the content are correct, I can retrieve the information I want. Any errors in the content itself I can clean up once I’ve got it.

This goes back to Postel’s Law, “Be conservative in what you send, be liberal in what you receive.” In my case, this means I have make my best effort to accept the data that I’m given, ignoring errors whenever possible. And using regular expressions makes that possible.

Now, I love XML (and XSLT, too), and I use it a lot. In fact, XML is my Golden Hammer – I can find a way to work it into just about every project. But in this case I’m working with data that’s not entirely under my control, and I need to be as flexible as I can. And hammers aren’t noted for flexibility!

So Long, XML. Hello, RegEx.

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112