Test your knowledge!Take a quiz to access yourself.

Advance Regular Expressions and Terminologies

Today, well i am planning to give you some brief on terminologies we use in Regular expressions. But hey, stop your eyes right here if you are not familiar with Basic Regular expression and go through my old posts What is Regular Expression?  and lets-practice-regular-expression .  Not wasting anymore of your time on old age professional talks, let’s talk about something of yours and mine interest:

Be careful Using Regular Expressions:

Infinite Error:

The term defines itself. Something is keep on moving to infinite and cause the error. In regular expression language, A expression which can match 0 characters, and therefore matches infinitely.

Example:

.* is able to match an empty string of 0 characters, so will match infinitely.

Isn’t it ridiculous to have such expression plausible usage. I mean, you can end up matching anything. Just to remind you, a dot  meta character covers any character except newline(\x0a).

 

The Timeout Error:

It happens when a expression takes longer than 250ms to execute. For some expressions, the time to execute grows exponentially, often due to nested quantifiers. It’s like one infinite loop running another infinite loop.

Example:

When (a+)+Z is executed on aaaaaaaaaaaaaaaaaa, it attempts to match any number of ‘a’ characters any number of times, which results in exponential growth.

 

There are two kind of grouping of characters one can do while using regex. As the heading defines, one is capturing and another is non-capturing. The definition is easy as the name itself is enough to define it. Capturing Group is going to capture what it’s going to match upon and vice-versa for non-capturing group. Wait wait, there is more, you forgot to ask me how does this work? and where to use it?

Before i tell you about it’s working and applications, let me tell you why would you want to capture and not capture something.

  • There is a text you want your regex to match upon and there is a repetition of a string or a group of strings multiple times at very different locations and you don’t want to write same regex for them, then hey Capturing group is your friend here. The regex written in capturing group will match the strings in your text and will store them for future reference so you could use it again instead of writing the same regex. So it saves your time and efforts.
    • Example:

Text:  Hello world Hello

Regex: /(\w+) \w+ \1/

where ( ) is the Capturing group representation and \1 is the reference to value stored after matching by regex in ( )

  • There is a text you want your regex to match upon and there is a string repeating multiple times and glued together, then you can use non-capturing group with quantifiers to just match on the text without storing them for future reference.
    • Example:

Text : Hello Hello Hello Hello

Regex: /(?:\w+\s){4}/

 

Word Boundary Check

It’s a good practice you define boundaries to your word matches so that they won’t match on the parts of other words having exact those characters. For example, a match like /rock/ can match upon text containing word ‘rocket’ which you wouldn’t like it, would you?

So, better way to write it would be in between meta \b which defines word boundary. Focus here, it’s a boundary match so it defines the position not the match itself. In other words, you can say \b doesn’t match on any characters in the text.

For example:

Text: bound but not boundary

Regex: /\bbound\b/

Be careful about using \b. It might give you false positives if you don’t use them properly. For example, if you want to match on lines containing legit talentcookie domain, ie., talentcookie.com. Then using /\btalentcookie\.com\b/ might trigger on many other things than just talentcookie.com like:

talentcookie.com.phish.com/

phish-talentcookie.com/

As \b considers hyphen ‘-‘ and ‘.’ as word boundary because the characters covered under word match are caps and small letters, digits and underscore, i.e.,[A-Za-z\d_].

 

 

Escaping in Character Sets

New regex players get into a habit of escaping any special characters they use while writing regex specially in character sets. One should learn by time the use of proper escaping as over-escaping characters which without escaping have same meaning makes your regex look dirty and confusing sometimes. There are characters that have special meaning outside character set and needs escaping but inside a character set, a very few characters(sometimes based on their position) becomes special than others. Somebody might also doubt on your hands on regex if you don’t avoid escaping them.

Characters that need escaping in a character set:

\    – >    so use it like  [\\]

]     – > so use it like [\]]

–     ->   Based on it’s position the escaping it matters. For example, if it’s used in the middle of others characters in character set, it needs to be escaped so it doesn’t define a range. So use it like [\w\-\s]. Otherwise if you want to put it in the start or the end of character set, there is no escaping needed. So, you may use it like [-\w\s] or [\w\s-].

^ -> so use it like [\^] if you want to use it in the start of character set. Otherwise, no escaping is needed.

Other characters in [] can be used as normal without escaping including dot, white space, @ etc.

 

 

Add a Comment

Your email address will not be published. Required fields are marked *