# Regular Expressions

|The name might look heavy or give a complex pre-assumption but trust me, it’s not. It’s very easy to understand and working on it is interesting. First, let me introduce you to all about it like what is it? why do we need to use it? How much effective it is?

## So, “what is it?”

Regular expression is a pattern that matches against text. Let’s simplify it word by word. To understand regular expression, we need to understand what is **pattern. **When you find things happening in an order repeatedly, that is a pattern. For example, every 3rd glass in a row of 100 glasses is of blue color, so that’s a pattern of blue glass being followed till the count of glass reaches its limit. One example, that we have always ignored but we learnt it subconsciously is LINE. Line is made up of words with spaces between them and ending with some punctuation, so there is a pattern in it. There are a lot of such examples like URL, Phone numbers, working hours of an employee in a company and many more. So, if we are able to find a pattern out of some given text, then we can develop an expression to extract the data that matched out of whole text and that expression we call Regular Expression.

## Now we know what is it, we need to know “why do we need to we use it?”

Well, to answer it, i need to mold this question to “What are the applications of Regular expressions?”. The main application to Regular expression is in writing pattern to capture something out of data, so we have a wide range of computer applications that are dependent on extraction of specific values out of database. It also covered all the network detection systems(Intrusions ones too) and Prevention systems. We got a wide jobs done by cyber security companies dependent on Regular expressions. Almost every high level languages use these expressions as their one the important plugins in various tasks. So, you can imagine how of the software applications in the market are using Regular expressions in their codes.

## How much effective it is?

Well, it’s all we got so far until some freak out there comes with another option. It has helped me extracting almost every match that i wanted to but some match were there that demanded more power in these expressions.

Let’s learn how to use them:

Conceptually, the simplest regular expressions are literal characters. The pattern `N`

matches the character ‘N’.

Regular expressions next to each other match sequences. For example, the pattern `Nick`

matches the sequence ‘N’ followed by ‘i’ followed by ‘c’ followed by ‘k’.

If you’ve ever used `grep`

on Unix—even if only to search for ordinary looking strings—you’ve already been using regular expressions! (The `re`

in `grep`

refers to regular expressions.)

## It’s all about the sequence

Adding just a little complexity, you can match either ‘Nick’ or ‘nick’ with the pattern `[Nn]ick`

. The part in square brackets is a *character class*, which means it matches exactly one of the enclosed characters. You can also use ranges in character classes, so `[a-c]`

matches either ‘a’ or ‘b’ or ‘c’.

The pattern `.`

is special: rather than matching a literal dot only, it matches *any* character. It’s the same conceptually as the really big character class `[-.?+%$A-Za-z0-9...]`

.

Think of character classes as menus: pick just one.

## Some Cool shortcuts

Using `.`

can save you lots of typing, and there are other shortcuts for common patterns. Say you want to match non-negative integers: one way to write that is `[0-9]+`

. Digits are a frequent match target, so you could instead use `\d+`

match non-negative integers. Others are `\s`

(whitespace) and `\w`

(word characters: alphanumerics or underscore).

The uppercased variants are their complements, so `\S`

matches any *non*-whitespace character, for example.

## Sometimes Once is not enough

From there, you can repeat parts of your pattern with *quantifiers*. For example, the pattern `ab?c`

matches ‘abc’ or ‘ac’ because the `?`

quantifier makes the subpattern it modifies optional. Other quantifiers are

`*`

(zero or more times)`+`

(one or more times)`{n}`

(exactly*n*times)`{n,}`

(at least*n*times)`{n,m}`

(at least*n*times but no more than*m*times)

Putting some of these blocks together, the pattern `[Nn]*ick`

matches all of

- ick
- Nick
- nick
- Nnick
- nNick
- nnick
*(and so on)*

The first match demonstrates an important lesson: * * always succeeds!* Any pattern can match zero times.

## Get the Group

A quantifier modifies the pattern to its immediate left. You might expect `0abc+0`

to match ‘0abc0’, ‘0abcabc0’, and so forth, but the pattern *immediately* to the left of the plus quantifier is `c`

. This means `0abc+0`

matches ‘0abc0’, ‘0abcc0’, ‘0abccc0’, and so on.

To match one or more sequences of ‘abc’ with zeros on the ends, use `0(abc)+0`

. The parentheses denote a subpattern that can be quantified as a unit. It’s also common for regular expression engines to save or “capture” the portion of the input text that matches a parenthesized group. Extracting bits this way is much more flexible and less error-prone than counting indices and `substr`

.

## Alternation

Earlier, we saw one way to match either ‘Nick’ or ‘nick’. Another is with alternation as in `Nick|nick`

. Remember that alternation includes everything to its left and everything to its right. Use grouping parentheses to limit the scope of `|`

, *e.g.*, `(Nick|nick)`

.

For another example, you could equivalently write `[a-c]`

as `a|b|c`

, but this is likely to be suboptimal because many implementations assume alternatives will have lengths greater than 1.

## Escaping

Although some characters match themselves, others have special meanings. The pattern `\d+ `

doesn’t match backslash followed by lowercase D followed by a plus sign: to get that, we’d use `\\d\+`

. A backslash removes the special meaning from the following character.

## Greediness

Regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.

For example, say the input is

“Hello,” she said, “How are you?”

You might expect `".+"`

to match only ‘*Hello,’* and will then be surprised when you see that it matched from *Hello* all the way through *you?*.

To switch from greedy to what you might think of as cautious, add an extra `?`

to the quantifier. Now you understand how “`.+?"`

, the example from your question works. It matches the sequence of a literal left-parenthesis, followed by one or more characters, and terminated by a right-parenthesis.

## Anchors

Use the special pattern `^`

to match only at the beginning of your input and `$`

to match only at the end. Making “bookends” with your patterns where you say, “I know what’s at the front and back, but give me everything between” is a useful technique.

Say you want to match comments of the form

**— This is a comment —**

you’d write `^--\s+(.+)\s+--$`

.

First of all I would like to say that its really nice blog. Secondly, I was confused in last example about comments. If you want to match a single space then why using “\s+” at the end ?

Hi Neel, ‘\s+’ represents atleast one space is there. The point behind using + here is to make it generic pattern which helps to catch the string even with small variations.

Hey this is awesome post!!

Just one question. In the example with regexpr. \((.+?)\) why were the backslash included? Shouldn’t it be ((.+?)). ?

Thanks Jimmy. The post has been updated with your suggested fix.