Back before I migrated from XML to regular expressions, I used XSL transforms to change various flavors of RSS and Atom feeds into a common format for importing. XSLT had a very nice function in it called normalize-space()
. This function would take a string and return you that same string, except with all instances of multiple whitespace characters reduced to a single space. This was pretty handy, as I needed to be able to count words so I could create a short extract, and knowing that I’d only need to worry about a single space at a time.
When I moved the GetExtract functionality into Visual Basic, I figured I didn’t need to worry about this, since I would be using the String.Split function to create an array of words, and that function would be smart enough to deal with consecutive spaces, right? Turns out I wasn’t giving Bill Gates and his minions enough credit. When the Split function is confronted with two or more consecutive spaces, it does indeed count some of them as words*. A web search didn’t turn up a native .NET way to do this, so I had to implement it myself.
And as it turns out it’s pretty simple. I just used the regular expression \s\s+
to match any sequence of more than a single whitespace character – \s
matches whitespace, and +
means one or more occurences.
Here’s all the code required:
Public Shared Function NormalizeWhitespace (ByVal InputStr As String) As String
Dim NormRx As Regex = New Regex("\s\s+")
Return NormRx.Replace(InputString.Trim, " ")
End Function
That’s it, and it works like a champ.
* As it happens I didn’t check to see if it was counting the spaces themselves as words, or if it was creating words that were empty strings (i.e. the text “between” consecutive spaces). Either way, I was getting extracts that had 10 or 12 words instead of the desired 25.