Quantcast
Channel: My Boog Pages » Programming
Viewing all articles
Browse latest Browse all 5

Normalizing Spaces with Regular Expressions

$
0
0

Back before I migrated from XML to regular expressions, I used XSL transforms to change various flavors of RSS and Atom feeds into a common format for importing. XSLT had a very nice function in it called normalize-space(). This function would take a string and return you that same string, except with all instances of multiple whitespace characters reduced to a single space. This was pretty handy, as I needed to be able to count words so I could create a short extract, and knowing that I’d only need to worry about a single space at a time.

When I moved the GetExtract functionality into Visual Basic, I figured I didn’t need to worry about this, since I would be using the String.Split function to create an array of words, and that function would be smart enough to deal with consecutive spaces, right? Turns out I wasn’t giving Bill Gates and his minions enough credit. When the Split function is confronted with two or more consecutive spaces, it does indeed count some of them as words*. A web search didn’t turn up a native .NET way to do this, so I had to implement it myself.

And as it turns out it’s pretty simple. I just used the regular expression \s\s+ to match any sequence of more than a single whitespace character – \s matches whitespace, and + means one or more occurences.

Here’s all the code required:

Public Shared Function NormalizeWhitespace (ByVal InputStr As String) As String

    Dim NormRx As Regex = New Regex("\s\s+")
    Return NormRx.Replace(InputString.Trim, " ")

End Function

That’s it, and it works like a champ.

* As it happens I didn’t check to see if it was counting the spaces themselves as words, or if it was creating words that were empty strings (i.e. the text “between” consecutive spaces). Either way, I was getting extracts that had 10 or 12 words instead of the desired 25.


Viewing all articles
Browse latest Browse all 5

Trending Articles