Regular Expression in the Real World

Posted by Miguel Ángel Domínguez Coloma on Jun 9, 2015

Regular expressions are patterns used to search or search and replace in a text.

The purpose of this post is to motivate the use of RegEx in our real life, showing examples of common uses.

Jump to the Cheap Sheet section before continuing if you are not familiarized with the RegEx syntax.

Contents:

Searching for a class in C# with prefix and suffix

Problem

In this example, we can consider that we have several classes that contains a prefix and suffix in the name of the class. For example, it’s very common that developers write sometimes the name of the project as the prefix and the base class as suffix:

public class MyProjectCardAdapter

public class MyProjectWorkshopAdapter

public class MyProjectWheelsAdapter
Solution
(class\s+MyProject)([\d\w]+)(Adapter)
Tips
  • Use \s+ in order to skip as many spaces between "class" and the name of the class in case we have more than one space.
  • [\d\w]+ means that we can have a sequence (1..infinity) marked by + which contains digits or words used for the name of the adapter.
  • Use parentheses to organize our search and be more readable. But they are optional.

Refactoring some HTML code

Problem

We would like to refactor some HTML code. For this example, we have some old code that uses paragraphs and classes to define titles and subtitles in certain text.

<p class="title title-section">Title here

blah blah blah
<p class="bold-text">blah blah blah

blah blah blah
<p class="title title-subsection">Subtitle here
<p class="bold-text">blah blah blah

blah blah blah
Solution
Search:
<p(.*?)class="(.*?)title-section(.*?)"[^>]*>(.*?)

Replace:
<h1>$3</h1>
Search:
<p(.*?)class="(.*?)title-subsection(.*?)"[^>]*>(.*?)

Replace:
<h2>$3</h2>
Result
<h1>Title here</h1>

blah blah blah
<p class="bold-text">blah blah blah

blah blah blah
<h2>Subtitle here</h2>
<p class="bold-text">blah blah blah

blah blah blah
Tips
  • Use (.*?) if you don't know there is something before the class you are looking for — same rule for after. Doing (.*?)WHATEVER(.*?), looks, in normal search, like *WHATEVER*.
  • Instead, use (.?) at the close of the HTML tag, use [^>] to define that until you get to the next >.
  • Group what you want to reuse in the replacement. In our case, we group the content between the paragraphs in the third group ($3).

Web Api validation in .NET using models

Problem

We would like to include validation in the attributes of our models when the client makes a call to our API. Adding validation on the individual attributes of our models, we skip adding more logic inside our controllers and make them focus on the business logic.

public class RegisterBindingModel
{
    //...

    [Required]
    [DataType(DataType.Password)]
    [Display(Name = "Password")]
    public string Password { get; set; }

    //...
}
Solution

We would like to have at least one digit, one uppercase letter and one lowercase letters. The minimal amount of digits is 6, but this can be validated with another attribute on .NET.

RegEx: ^(?=.[a-z])(?=.[A-Z])(?=.*\d)[a-zA-Z\d]{6,}$

public class RegisterBindingModel
{
    //...

    [Required]
    [StringLength(100, ErrorMessage = "The {0} must be at least {2} characters long.", MinimumLength = 6)]
    [RegularExpression(@"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[a-zA-Z\d]*$",
     ErrorMessage = "The password should have upercase letter, lowercase letters and digits.")]
    [DataType(DataType.Password)]
    [Display(Name = "Password")]
    public string Password { get; set; }

    //...
}
Tips
  • Using Lookahead we can verify some conditions before match the text
  • With (?=.*[a-z]) we will look ahead that the text contains at least one lowercase character
  • With (?=.*[A-Z]) we will look ahead that the text contains at least one lowercase character
  • With (?=.*\d) we will look ahead that the text contains at least one lowercase character
  • [a-zA-Z\d]* is the "real" match, but it does not contains the logic to understand if there is at least one minimal appearance of a lowercase letter, uppercase letter and a digit

In this example, we would like to parse information from a website (HTML) and recollect it inside some objects in our C# application.

Without focusing on the example to scrap, let's say we created a regular expression that recognizes items in a list, where:

  • the first group correspond to the name of a product;
  • the second one correspond to its composition;
  • and the third one to its price.
<li>(.+?)\s*?\((.*?)\)\s*?(\d+|\d+.\d+)</li>

Web scraping

Problem

Using .NET C#, we created a class to recreate the information that we want to obtain:

internal class Dish
{
    public string Name { get; set; }
    public string Composition { get; set; }
    public float Price { get; set; }
}

Using the Regex class and iterating over the Match class, we can obtain the information of the search for each found on the input text (HTML):

private static ICollection<Dish> ParseHtmlInput(string htmlInputText)
{
    List<Dish> result = new List<Dish>();
    Regex regex = new Regex(@"
<li>(.+?)\s*?\((.*?)\)\s*?(\d+|\d+.\d+)</li<");
    Match match = regex.Match(htmlInputText);

    while (match.Success)
    {
        var name = match.Groups[1].Value;
        var composition = match.Groups[2].Value;
        var price = match.Groups[3].Value;

        var priceAsFloat = 0f;
        float.TryParse(price, out priceAsFloat);

        result.Add(new Dish { Name = name, Composition = composition, Price = priceAsFloat });

        match = match.NextMatch();
    }

    return result;
}

Common Search Regex cheat sheet

Regex Description Example
\d Match a number from 0 to 9 (\D match the opposite) \d\d\d
\w Match a word character from A-Z and a-z (\W match the opposite) \w\w\w
\s Match any space character but next line \s\s\s
\n Match next line \n\n\n
. Match any character but next line ...
* Repetition of character (or group) from 0 to infinity \d*
+ Repetition of character (or group) from 1 to infinity \d+
? Repetition of character (or group) from 0 to 1 \d?
{N1}, {N1,N2} Repetition from N1 to N2 \d{1,5}
[...] Explicit set of characters that can match [abc] means (a
[^...] Explicit set of characters that cannot match [^abc] means !(a

Common Replace Regex cheat sheet

Regex Description Example
(...) Grouping sequence
$N Used on replace section means replace for the group number N Search: hello (\w+) Replace: goodbye $1
\N Used on search means to match the same Regex than group number N Search: <(\w+)>.*

Lookahead and Lookbehind cheat sheet

Regex Description Example
(?=X) Lookahead: Asserts that what immediately follows the current position in the string is X (?=foo)
(?<=X) Lookbehind: Asserts that what immediately precedes the current position in the string is X (?<=foo)
(?!X) Negative Lookahead: Asserts that what immediately follows the current position in the string is not X (?!foo)
(?<!X) Negative Lookbehind: Asserts that what immediately precedes the current position in the string is not X (?<!foo)

References:

Topics: Regular Expression, lookahead, web development, lookbehind, RegEx, Blog

Popular posts

Posts by Topic

see all