Regular expressions. A Guide to Regular Expressions in JavaScript. What is a regular expression

Regular Expressions ( RegExp) is a very efficient way to work with strings.

By constructing a regular expression using special syntax, you can:

  • search text in line
  • replace substrings in line
  • retrieve information from string

Almost all programming languages ​​have regular expressions. There are slight differences in implementation, but the general concepts apply almost everywhere.

Regular expressions date back to the 1950s, when they were formalized as a conceptual search pattern for string processing algorithms.

Regular expressions implemented in UNIX, such as grep, sed and popular text editors, began to gain popularity and were added to the Perl programming language, and later to many other languages.

JavaScript, along with Perl, is one of the programming languages ​​that has support for regular expressions built directly into the language.

Difficult, but useful

Regular expressions can seem like absolute nonsense to beginners, and often even to professional developers, if you don't invest the time necessary to understand them.

Regular Expressions difficult to write, difficult to read And difficult to maintain/change.

But sometimes regular expressions are the only reasonable way perform some string manipulation, so they are a very valuable tool.

This tutorial aims to give you some understanding of regular expressions in JavaScript in the simplest possible way and provide information on how to read and create regular expressions.

The rule of thumb is that simple regular expressions are easy to read and write, while complex regular expressions can quickly become a mess unless you have a deep understanding of the basics.

What regular expressions look like

In JavaScript, a regular expression is an object that can be defined in two ways.

The first way is to create new RegExp object using the constructor:

Const re1 = new RegExp("hey")

The second way is to use regular expression literals:

Const re1 = /hey/

Do you know what JavaScript has? object literals And array literals? It also contains regexp literals.

In the above example hey is called template. In literal form it is between two slashes, but in the case of an object constructor, it is not.

This is the first important difference between the two ways of defining regular expressions; we'll see the rest later.

How do they work?

The regular expression we defined above as re1 is very simple. It searches for the string hey without any restrictions: the string can contain a lot of text, and the word hey is somewhere in the middle and the regular expression will work. The line can only contain the word hey and the regular expression will work again.

It's pretty simple.

You can try testing the regular expression using the RegExp.test(String) method, which returns a boolean value:

Re1.test("hey") // ✅ re1.test("blablabla hey blablabla") // ✅ re1.test("he") // ❌ re1.test("blablabla") // ❌

In the example above, we simply checked whether "hey" matches the regular expression pattern stored in re1 .

It's a piece of cake, but you already know a lot about regular expressions.

Consolidation

/hey/

will work no matter where hey is inside the line.

If you want to find lines that start with hey, then use the ^ operator:

/^hey/.test("hey") // ✅ /^hey/.test("bla hey") // ❌

If you want to find lines that end with hey, then use the $ operator:

/hey$/.test("hey") // ✅ /hey$/.test("bla hey") // ✅ /hey$/.test("hey you") // ❌

By combining the two previous statements you can find a string that exactly matches hey:

/^hey$/.test("hey") // ✅

To find a string that starts with one substring and ends with another substring you can use .* , which will match any character repeated 0 or more times:

/^hey.*joe$/.test("hey joe") // ✅ /^hey.*joe$/.test("heyjoe") // ✅ /^hey.*joe$/.test("hey how are you joe") // ✅ /^hey.*joe$/.test("hey joe!") // ❌

Finding elements by range

Instead of searching for a specific string, you can specify a range of characters, like this:

// // a, b, c, ... , x, y, z // // A, B, C, ... , X, Y, Z // // a, b, c // / / 0, 1, 2, 3, ... , 8, 9

These regular expressions look for strings that contain at least one character from the selected range:

//.test("a") // ✅ //.test("1") // ❌ //.test("A") // ❌ //.test("d") // ❌ // .test("dc") // ✅

Ranges can be combined:

// //.test("a") // ✅ //.test("1") // ✅ //.test("A") // ✅

Finding multiple matches of a range element

You can check if a string contains only one character from a range using the - character:

/^$/ /^$/.test("A") // ✅ /^$/.test("Ab") // ❌

Pattern inversion

The ^ character at the beginning of a pattern anchors it to the beginning of the line.

Using this character inside a range inverts the range, so:

/[^A-Za-z0-9]/.test("a") // ❌ /[^A-Za-z0-9]/.test("1") // ❌ /[^A-Za -z0-9]/.test("A") // ❌ /[^A-Za-z0-9]/.test("@") // ✅

Metacharacters

  • \d matches any number, equivalent
  • \D matches any character that is not a number, equivalent to [^0-9]
  • \w matches any alphanumeric character, equivalent
  • \W matches any character that is not an alphanumeric value, equivalent to [^A-Za-z0-9]
  • \s matches any whitespace character: space, tab, newline, and Unicode spaces
  • \S matches any character that is not a space
  • \0 matches null
  • \n matches newline character
  • \t matches tab character
  • \uXXXX matches Unicode character with code XXXX (requires u flag)
  • . matches any character except a newline (such as \n) (if you don't use the s flag, we'll explain later)
  • [^] matches any character, including newline. Useful when working with multiline strings

Selection in regular expressions

If you want to choose one or another line, use the | operator .

/hey|ho/.test("hey") // ✅ /hey|ho/.test("ho") // ✅

Quantifiers

Imagine that you have a regular expression that checks a string to make sure it contains only one digit:

you can use quantifier? , which will make this character optional. In our case, the digit must appear 0 or 1 time:

but what if we want the regular expression to work on multiple digits?

You can do this in 4 ways using + , * , (n) and (n,m) .

+

Matches one or more (>=1) elements:

/^\d+$/ /^\d+$/.test("12") // ✅ /^\d+$/.test("14") // ✅ /^\d+$/.test("144343" ) // ✅ /^\d+$/.test("") // ❌ /^\d+$/.test("1a") // ❌

*

Matches 0 or more (>=0) elements:

/^\d+$/ /^\d*$/.test("12") // ✅ /^\d*$/.test("14") // ✅ /^\d*$/.test( "144343") // ✅ /^\d*$/.test("") // ✅ /^\d*$/.test("1a") // ❌

(n)

Matches exactly n number of elements:

/^\d(3)$/ /^\d(3)$/.test("123") // ✅ /^\d(3)$/.test("12") // ❌ /^\ d(3)$/.test("1234") // ❌ /^(3)$/.test("Abc") // ✅

(n,m)

Matches the range from n to m elements:

/^\d(3,5)$/ /^\d(3,5)$/.test("123") // ✅ /^\d(3,5)$/.test("1234") // ✅ /^\d(3,5)$/.test("12345") // ✅ /^\d(3,5)$/.test("123456") // ❌

m can be omitted and the second limit left unconstrained so that there are at least n elements:

/^\d(3,)$/ /^\d(3,)$/.test("12") // ❌ /^\d(3,)$/.test("123") // ✅ /^\d(3,)$/.test("12345") // ✅ /^\d(3,)$/.test("123456789") // ✅

Optional elements

The next character after the element? , will make it optional:

/^\d(3)\w?$/ /^\d(3)\w?$/.test("123") // ✅ /^\d(3)\w?$/.test(" 123a") // ✅ /^\d(3)\w?$/.test("123ab") // ❌

Groups

Using parentheses, you can create groups of characters (...) .

The example below looks for an exact match of 3 digits followed by one or more alphanumeric characters:

/^(\d(3))(\w+)$/ /^(\d(3))(\w+)$/.test("123") // ❌ /^(\d(3))( \w+)$/.test("123s") // ✅ /^(\d(3))(\w+)$/.test("123something") // ✅ /^(\d(3))( \w+)$/.test("1234") // ✅

Repeated characters that appear after the closing bracket of a group apply to the entire group:

/^(\d(2))+$/ /^(\d(2))+$/.test("12") // ✅ /^(\d(2))+$/.test(" 123") // ❌ /^(\d(2))+$/.test("1234") // ✅

Capturing groups

So far we have seen how to test strings and check if they contain a certain pattern.

The cool thing about regular expressions is that you can capture specific parts of a string and put them into an array.

You can do this using groups, or rather using capturing groups.

By default, groups are captured anyway. Now, instead of using RegExp.test(String) which simply returns a boolean value, we will use one of the following methods:

  • String.match(RegExp)
  • RegExp.exec(String)

They are exactly the same and both return an array with the string being tested as the first element and the remaining elements containing matches for each group found.

If no matches are found, then it returns null.

"123s".match(/^(\d(3))(\w+)$/) //Array [ "123s", "123", "123s" ] /^(\d(3))(\w+ )$/.exec("123s") //Array [ "123s", "123", "s" ] "hey".match(/(hey|ho)/) //Array [ "hey", "hey " ] /(hey|ho)/.exec("hey") //Array [ "hey", "hey" ] /(hey|ho)/.exec("ha!") //null

When a group matches multiple times, only the last value found will be added to the returned array.

"123456789".match(/(\d)+/) //Array [ "123456789", "9" ]

Optional groups

Capturing groups can be made optional using (...)? . If nothing is found, an undefined element will be added to the returned array:

/^(\d(3))(\s)?(\w+)$/.exec("123 s") //Array [ "123 s", "123", " ", "s" ] /^ (\d(3))(\s)?(\w+)$/.exec("123s") //Array [ "123s", "123", undefined, "s" ]

Link to the found group

Each group found is assigned a number. $1 refers to the first element, $2 to the second, and so on. This is useful when we talk about replacing part of a string.

Named Group Capture

This is a new feature in ES2018.

The group can be assigned a name rather than just a slot in the returned array:

Const re = /(? \d(4))-(? \d(2))-(? \d(2))/ const result = re.exec("2015-01-02") // result.groups.year === "2015"; // result.groups.month === "01"; // result.groups.day === "02";

Using match and exec without groups

There is a difference when using match and exec without groups: the first element of the array will not contain the completely found string, but a direct match:

/hey|ho/.exec("hey") // [ "hey" ] /(hey).(ho)/.exec("hey ho") // [ "hey ho", "hey", "ho " ]

Uncaptured groups

Since groups are captureable by default, we need a way to ignore some groups in the returned array. This is possible using uncaptured groups, which begin with (?:...) .

"123s".match(/^(\d(3))(?:\s)(\w+)$/) // null "123 s".match(/^(\d(3))(?: \s)(\w+)$/) // Array [ "123 s", "123", "s" ]

Flags

You can use the following flags on any regular expression:

  • g: searches for matches globally
  • i: makes the regular expression case insensitive
  • m: Enables multiline mode. In this mode, ^ and $ match the beginning and end of the entire line. Without this flag, with multiline strings they match the beginning and end of each line.
  • u: enables Unicode support (added in ES6/ES2015)
  • s: (new in ES2018) short for "single line", it allows. match newlines

Flags can be combined, and they are also added to the end of the literal string:

/hey/ig.test("HEy") // ✅

or passed as the second parameter to the constructor of the RegExp object:

New RegExp("hey", "ig").test("HEy") // ✅

Regular Expression Inspection

You can inspect regular expression properties:

  • source - template string
  • multiline - true if the m flag is set
  • global - true if the g flag is set
  • ignoreCase - true if the i flag is set
  • lastIndex
/^(\w(3))$/i.source //"^(\\d(3))(\\w+)$" /^(\w(3))$/i.multiline //false /^(\w(3))$/i.lastIndex //0 /^(\w(3))$/i.ignoreCase //true /^(\w(3))$/i.global // false

Shielding

Special symbols:

These are special characters because they are control characters when composing regular expression patterns, so if you want to use them to find matches within a pattern, you need to escape them with a backslash character:

/^\\$/ /^\^$/ // /^\^$/.test("^") ✅ /^\$$/ // /^\$$/.test("$") ✅

Row Boundaries

\b and \B allow you to determine whether a line is at the beginning or end of a word:

  • \b matches if the character set is at the beginning or end of a word
  • \B matches if the character set is not at the beginning or end of the word

"I saw a bear".match(/\bbear/) //Array ["bear"] "I saw a beard".match(/\bbear/) //Array ["bear"] "I saw a beard" .match(/\bbear\b/) //null "cool_bear".match(/\bbear\b/) //null

Replacing with Regular Expressions

We have already seen how to check strings for matching a pattern.

We also saw how you can extract part of the strings corresponding to the pattern into an array.

Now let's look at how replace parts of a string based on a template.

The String object in JavaScript has a replace() method that can be used without regular expressions to one replacement in line:

"Hello world!".replace("world", "dog") //Hello dog! "My dog ​​is a good dog!".replace("dog", "cat") //My cat is a good dog!

This method can also take a regular expression as an argument:

"Hello world!".replace(/world/, "dog") //Hello dog!

Using the g flag is the only way replace multiple occurrences of a line in vanilla JavaScript:

"My dog ​​is a good dog!".replace(/dog/g, "cat") //My cat is a good cat!

Groups allow us to do more fancy things, swapping parts of strings:

"Hello, world!".replace(/(\w+), (\w+)!/, "$2: $1!!!") // "world: Hello!!!"

Instead of a string, you can use a function to do even more interesting things. A number of arguments will be passed to it, such as those returned by the String.match(RegExp) or RegExp.exec(String) methods, where the number of arguments depends on the number of groups:

"Hello, world!".replace(/(\w+), (\w+)!/, (matchedString, first, second) => ( console.log(first); console.log(second); return `$( second.toUpperCase()): $(first)!!!` )) //"WORLD: Hello!!!"

Greed

Regular expressions are called greedy default.

What does it mean?

Take for example this regular expression:

/\$(.+)\s?/

We are supposed to extract the dollar amount from a string:

/\$(.+)\s?/.exec("This costs $100") //0

but what if we have more words after the number, it's distracting

/\$(.+)\s?/.exec("This costs $100 and it is less than $200") //100 and it is less than $200

Why? Because the regular expression after the $ sign matches any .+ character and does not stop until it reaches the end of the line. Then it stops because \s? makes the trailing space optional.

To fix this, we need to specify that the regular expression should be lazy and find the smallest number of matches. Can we do this with a symbol? after the quantifier:

/\$(.+?)\s/.exec("This costs $100 and it is less than $200") //100

So, symbol? can mean different things depending on its position, so it can be both a quantifier and an indicator lazy mode.

Lookahead: Match a string depending on what follows it

Uses ?= to find matches in a string followed by a specific substring

/Roger(?=Waters)/ /Roger(?= Waters)/.test("Roger is my dog") //false /Roger(?= Waters)/.test("Roger is my dog ​​and Roger Waters is a famous musician") //true

Performs the reverse operation and finds matches in the string behind which Not follows a certain substring:

/Roger(?!Waters)/ /Roger(?! Waters)/.test("Roger is my dog") //true /Roger(?! Waters)/.test("Roger is my dog ​​and Roger Waters is a famous musician") //false

Hindsight: matching a string depending on what precedes it

This is a new feature in ES2018.

The lookahead uses the ?= symbol. Does retrospective use?<= :

/(?<=Roger) Waters/ /(?<=Roger) Waters/.test("Pink Waters is my dog") //false /(?<=Roger) Waters/.test("Roger is my dog and Roger Waters is a famous musician") //true

Does inversion of hindsight use?

/(?

Regular expressions and Unicode

The u flag is required when working with Unicode strings, particularly when it may be necessary to process strings in astral planes that are not included in the first 1600 Unicode characters.

For example, emoji, but that’s all.

/^.$/.test("a") // ✅ /^.$/.test("?") // ❌ /^.$/u.test("?") // ✅

Therefore, always use the u flag.

Unicode, like regular characters, can handle ranges:

//.test("a") // ✅ //.test("1") // ✅ /[?-?]/u.test("?") // ✅ /[?-?]/u .test("?") // ❌

JavaScript checks the view's internal codes, so?< ? < ? на самом деле \u1F436 < \u1F43A < \u1F98A . Посмотрите полный список эмодзи чтобы увидеть коды и узнать их порядок.

Escaping Unicode properties

As we said above, in a regular expression pattern you can use \d to match any number, \s to match any character other than a space, \w to match any alphanumeric character, etc.

Unicode property escaping is an ES2018 feature that adds a very cool feature by extending this concept to all Unicode characters and adding \p() and \P() .

Any Unicode character has a set of properties. For example, Script defines a language family, ASCII is a boolean value equal to true for ASCII characters, etc. You can put this property in curly braces and the regular expression will check that its value is true:

/^\p(ASCII)+$/u.test("abc") // ✅ /^\p(ASCII)+$/u.test("ABC@") // ✅ /^\p(ASCII) +$/u.test("ABC?") // ❌

ASCII_Hex_Digit is another boolean property that checks whether a string contains only valid hexadecimal digits:

/^\p(ASCII_Hex_Digit)+$/u.test("0123456789ABCDEF") //✅ /^\p(ASCII_Hex_Digit)+$/u.test("h")

There are many other boolean properties that you can check simply by adding their name in curly braces, including Uppercase , Lowercase , White_Space , Alphabetic , Emoji and others:

/^\p(Lowercase)$/u.test("h") // ✅ /^\p(Uppercase)$/u.test("H") // ✅ /^\p(Emoji)+$/ u.test("H") // ❌ /^\p(Emoji)+$/u.test("??") // ✅

In addition to these binary properties, you can test any Unicode character property to match a specific value. In the example below, I check whether a string is written in Greek or Latin alphabet:

/^\p(Script=Greek)+$/u.test("ελληνικά") // ✅ /^\p(Script=Latin)+$/u.test("hey") // ✅

Examples

Extracting a number from a string

Let's assume that there is a string containing only one number that needs to be extracted. /\d+/ should do this:

"Test 123123329".match(/\d+/) // Array [ "123123329" ]

Search for email address:

The simplest approach is to check for non-whitespace characters before and after the @ sign, using \S:

/(\S+)@(\S+)\.(\S+)/ /(\S+)@(\S+)\.(\S+)/.exec(" [email protected]") //["[email protected]", "copesc", "gmail", "com"]

However, this is a simplified example, since it includes many invalid E-mail addresses.

Capture text between double quotes

Let's imagine that you have a string that contains text enclosed in double quotes and you need to extract this text.

The best way to do this is to use group capture, because we know that our match must start and end with the character " , so we can easily customize the pattern, but we also want to remove those quotes from the result.

We will find what we need in result:

Const hello = "Hello "nice flower"" const result = /"([^"]*)"/.exec(hello) //Array [ "\"nice flower\"", "nice flower" ]

Getting content from an HTML tag

For example, get content from a span tag, while allowing any number of arguments to the tag:

/]*>(.*?)<\/span>/ /]*>(.*?)<\/span>/.exec("test") // null / ]*>(.*?)<\/span>/.exec("test") // ["test", "test"] / ]*>(.*?)<\/span>/.exec(" test") // ["test", "test"]

The cheat sheet is a general guide to regular expression patterns without taking into account the specifics of any language. It is presented in the form of a table that fits on one printed sheet of A4 size. Created under a Creative Commons license based on a cheat sheet authored by Dave Child ().

Remember that different programming languages ​​support regular expressions to varying degrees, so you may encounter a situation where some of the features shown do not work. For those who are just getting acquainted with regular expressions, this translation of the author's comments to the cheat sheet is offered. It will introduce you to some of the techniques used in building regular expression patterns.

Anchors in regular expressions indicate the beginning or end of something. For example, lines or words. They are represented by certain symbols. For example, a pattern matching a string starting with a number would look like this:

Here the ^ character denotes the beginning of the line. Without it, the pattern would match any string containing a digit.

Character classes in regular expressions match a certain set of characters at once. For example, \d matches any number from 0 to 9 inclusive, \w matches letters and numbers, and \W matches all characters other than letters and numbers. The pattern identifying letters, numbers and space looks like this:

POSIX

POSIX is a relatively new addition to the regular expression family. The idea, as with character classes, is to use shortcuts that represent some group of characters.

Almost everyone has trouble understanding affirmations at first, but as you become more familiar with them, you'll find yourself using them quite often. Assertions provide a way to say, “I want to find every word in this document that includes the letter “q” and is not followed by “werty.”

[^\s]*q(?!werty)[^\s]*

The above code starts by searching for any characters other than space ([^\s]*) followed by q . The parser then reaches a forward-looking assertion. This automatically makes the preceding element (character, group, or character class) conditional—it will match the pattern only if the statement is true. In our case, the statement is negative (?!), that is, it will be true if what is being sought in it is not found.

So, the parser checks the next few characters against the proposed pattern (werty). If they are found, then the statement is false, which means the character q will be “ignored”, that is, it will not match the pattern. If werty is not found, then the statement is true, and everything is in order with q. Then the search continues for any characters other than space ([^\s]*).

This group contains sample templates. With their help, you can see how regular expressions can be used in daily practice. However, note that they will not necessarily work in every programming language, as each has its own unique features and varying levels of regular expression support.

Quantifiers allow you to define a part of a pattern that must be repeated several times in a row. For example, if you want to find out whether a document contains a string of 10 to 20 (inclusive) letters "a", then you can use this pattern:

A(10,20)

By default, quantifiers are “greedy”. Therefore, the quantifier +, meaning “one or more times,” will correspond to the maximum possible value. Sometimes this causes problems, in which case you can tell the quantifier to stop being greedy (become "lazy") by using a special modifier. Look at this code:

".*"

This pattern matches text enclosed in double quotes. However, your source line could be something like this:

Hello World

The above template will find the following substring in this line:

"helloworld.htm" title="Hello World" !}

He turned out to be too greedy, grabbing the largest piece of text he could.

".*?"

This pattern also matches any characters enclosed in double quotes. But the lazy version (notice the modifier?) looks for the smallest possible occurrence, and will therefore find each double-quoted substring individually:

"helloworld.htm" "Hello World"

Regular expressions use certain characters to represent different parts of a pattern. However, a problem arises if you need to find one of these characters in a string, just like a regular character. A dot, for example, in a regular expression means “any character other than a line break.” If you need to find a point in a string, you can't just use " . » as a template - this will lead to finding almost anything. So, you need to tell the parser that this dot should be considered a regular dot and not "any character". This is done using an escape sign.

An escape character preceding a character such as a dot causes the parser to ignore its function and treat it as a normal character. There are several characters that require such escaping in most templates and languages. You can find them in the lower right corner of the cheat sheet (“Meta Symbols”).

The pattern for finding a point is:

\.

Other special characters in regular expressions match unusual elements in text. Line breaks and tabs, for example, can be typed on the keyboard but are likely to confuse programming languages. The escape character is used here to tell the parser to treat the next character as a special character rather than a regular letter or number.

String substitution is described in detail in the next paragraph, “Groups and Ranges,” but the existence of “passive” groups should be mentioned here. These are groups that are ignored during substitution, which is very useful if you want to use an "or" condition in a pattern, but do not want that group to take part in the substitution.

Groups and ranges are very, very useful. It's probably easier to start with ranges. They allow you to specify a set of suitable characters. For example, to check whether a string contains hexadecimal digits (0 to 9 and A to F), you would use the following range:

To check the opposite, use a negative range, which in our case fits any character except numbers from 0 to 9 and letters from A to F:

[^A-Fa-f0-9]

Groups are most often used when an "or" condition is needed in a pattern; when you need to refer to part of a template from another part of it; and also when substituting strings.

Using "or" is very simple: the following pattern looks for "ab" or "bc":

If in a regular expression it is necessary to refer to one of the previous groups, you should use \n , where instead of n substitute the number of the desired group. You may want a pattern that matches the letters "aaa" or "bbb" followed by a number and then the same three letters. This pattern is implemented using groups:

(aaa|bbb)+\1

The first part of the pattern looks for "aaa" or "bbb", combining the letters found into a group. This is followed by a search for one or more digits (+), and finally \1. The last part of the pattern references the first group and looks for the same thing. It looks for a match with the text already found by the first part of the pattern, not a match to it. So "aaa123bbb" will not satisfy the above pattern since \1 will look for "aaa" after the number.

One of the most useful tools in regular expressions is string substitution. When replacing text, you can reference the found group using $n . Let's say you want to highlight all the words "wish" in text in bold. To do this, you should use a regular expression replace function, which might look like this:

Replace(pattern, replacement, subject)

The first parameter will be something like this (you may need a few extra characters for this particular function):

([^A-Za-z0-9])(wish)([^A-Za-z0-9])

It will find any occurrences of the word "wish" along with the previous and next characters, as long as they are not letters or numbers. Then your substitution could be like this:

$1$2$3

It will replace the entire string found using the pattern. We start replacing with the first character found (that is not a letter or a number), marking it $1 . Without this, we would simply remove this character from the text. The same goes for the end of the substitution ($3). In the middle we've added an HTML tag for bold (of course, you can use CSS or ), allocating them the second group found using the template ($2).

Template modifiers are used in several languages, most notably Perl. They allow you to change how the parser works. For example, the i modifier causes the parser to ignore cases.

Regular expressions in Perl are surrounded by the same character at the beginning and at the end. This can be any character (most often “/” is used), and it looks like this:

/pattern/

Modifiers are added to the end of this line, like this:

/pattern/i

Finally, the last part of the table contains meta characters. These are characters that have special meaning in regular expressions. So if you want to use one of them as a regular character, then it needs to be escaped. To check for the presence of a parenthesis in the text, use the following pattern:

Really thank you. especially for the clarification. You're welcome :) thank you very much. thank you so much! Thank you Cool series... by the way, I’m translating this series from English (and doing it in HTML format), you can look at it on my website: sitemaker.x10.bz. There is also a cheat sheet on HTML, which is not here. Thank you. How about removing the first 10 characters of any kind, and then there will be some text with symbols, and then from a certain symbol you will need to remove everything to the end. !? 2 lails: Regular expressions are not needed here. Substr() and strpos() will help you if we are talking about PHP, or their analogues in other languages. It was interesting to read about the statements, I’m gradually starting to understand. It will be more clear this way: http://pcreonline.com/OazZNu/ Hello. Please tell me why “backward-looking statements” don’t work for me in FireFox? Mozilla's RegExp help doesn't have them at all, is it really impossible in Fox? =((( Good morning, backward-looking statements are not supported by JavaScript, so in all likelihood they will not work in other browsers either. This link has more detailed information about the limitations of regular expressions in the JavaScript language. Well done! Give a shout out! Thank you! Briefly and clearly! Hmm. Pasiba) Thank you! thank you, it was very helpful thank you very much! Thanks for the article! Tell me, what if you need to limit the password entry to numbers and enter no more than 5 letters? Hello, the cheat sheet is good for everyone, but it would be possible to make the zebra lighter , because when you print black letters on a dark background it’s not very good Thank you. A quick question, you need to find the values ​​between start= and &, but at the same time exclude these range boundaries from the output. How to find the range did: start=.(1,)&
But there is still not enough knowledge on how to eliminate borders. I would be grateful for your help. Please tell me how to set a regular expression to check (there may or may not be a match)? How to correctly write a regular expression that starts with the equal sign, finds any text inside and stops at the & sign
These characters are not included in the search; the required part of the string begins and ends with them...

I write in several ways, but as a result either all the text remains, but the = and & signs disappear
Or does the & remain at the end of the line...
I read about the dollar, it does not remove the character at the end of the line

small example

var reg = /[^=]*[^&]/g
str.match(reg);

Logically, we start with the equal sign and look for any text /[^=]*
then we stop at the sign & [^&] without including it in the search and repeat the search longer until we go around it completely /g

Doesn't work... Returns the entire string

Good evening, tell me how to find a number that is less than 20? Thanks guys Thanks for the article! Tell me, what if you need to limit password entry to numbers and no more than 5 letters?

Dima @ April 24, 2015
Answer:((?=.*\d)(?=.*)(?=.*).(8,15))--- at the end, instead of 8, just put 5

Hello everyone, I'm just starting...
Could you tell me what it means:
/^\w\w/a
I would be very grateful) Hello, tell me how to list all the numbers in this expression separated by a space 9*2 Divine cheat sheet! Solved all the questions :-) (M1)
(M2)
(M3)
(M4)
(M5)

Tell me how to write an expression to find where it occurs in the text

Thank you!

Some people, when faced with a problem, think: “Oh, I’ll use regular expressions.” Now they have two problems.
Jamie Zawinski

Yuan-Ma said, “It takes a lot of force to cut wood across the grain of the wood. It takes a lot of code to program across the problem structure.
Master Yuan-Ma, “Book of Programming”

Programming tools and techniques survive and spread in a chaotic evolutionary manner. Sometimes it is not the beautiful and brilliant that survive, but simply those that work well enough in their field - for example, if they are integrated into another successful technology.

In this chapter, we will discuss such a tool - regular expressions. This is a way to describe patterns in string data. They create a small, stand-alone language that is included in JavaScript and many other languages ​​and tools.

The regular schedules are both very strange and extremely useful. Their syntax is cryptic and their JavaScript programming interface is clunky. But it is a powerful tool for exploring and manipulating strings. Once you understand them, you will become a more effective programmer.

Creating a regular expression

Regular – object type. It can be created by calling the RegExp constructor, or by writing the desired template, surrounded by slashes.

Var re1 = new RegExp("abc"); var re2 = /abc/;

Both of these regular expressions represent the same pattern: the character “a” followed by the character “b” followed by the character “c”.

If you use the RegExp constructor, then the pattern is written as a regular string, so all the rules regarding backslashes apply.

The second entry, where the pattern is between slashes, treats backslashes differently. First, since the pattern ends with a forward slash, we need to put a backslash before the forward slash that we want to include in our pattern. Additionally, backslashes that are not part of special characters like \n will be preserved (rather than ignored as in strings), and will change the meaning of the pattern. Some characters, such as the question mark or plus, have a special meaning in regular expressions, and if you need to find such a character, it must also be preceded by a backslash.

Var eighteenPlus = /eighteen\+/;

To know which characters need to be preceded by a slash, you need to learn a list of all special characters in regular expressions. This is not yet possible, so when in doubt, just put a backslash in front of any character that is not a letter, number or space.

Checking for matches

Regulars have several methods. The simplest one is test. If you pass it a string, it will return a Boolean value indicating whether the string contains an occurrence of the given pattern.

Console.log(/abc/.test("abcde")); // → true console.log(/abc/.test("abxde")); // → false

A regular sequence consisting only of non-special characters is simply a sequence of these characters. If abc is anywhere in the line we're testing (not just at the beginning), test will return true.

Looking for a set of characters

You could also find out whether a string contains abc using indexOf. Regular patterns allow you to go further and create more complex patterns.

Let's say we need to find any number. When we put a set of characters in square brackets in regular expression, it means that that part of the expression matches any of the characters in the brackets.

Both expressions are in lines containing a number.

Console.log(//.test("in 1992")); // → true console.log(//.test("in 1992")); // → true

In square brackets, a dash between two characters is used to specify a range of characters, where the sequence is specified by the Unicode encoding. The characters from 0 to 9 are there just in a row (codes from 48 to 57), so it captures them all and matches any number.

Several character groups have their own built-in abbreviations.

\d Any number
\w Alphanumeric character
\s Whitespace character (space, tab, newline, etc.)
\D not a number
\W is not an alphanumeric character
\S is not a whitespace character
. any character except line feed

Thus, you can set the date and time format like 01/30/2003 15:20 with the following expression:

Var dateTime = /\d\d-\d\d-\d\d\d\d \d\d:\d\d/; console.log(dateTime.test("30-01-2003 15:20")); // → true console.log(dateTime.test("30-Jan-2003 15:20")); // → false

Looks terrible, doesn't it? There are too many backslashes, which makes the pattern difficult to understand. We'll improve it slightly later.

Backslashes can also be used in square brackets. For example, [\d.] means any number or period. Notice that the period inside the square brackets loses its special meaning and becomes simply a period. The same applies to other special characters, such as +.

You can invert a set of characters - that is, say that you need to find any character except those that are in the set - by placing a ^ sign immediately after the opening square bracket.

Var notBinary = /[^01]/; console.log(notBinary.test("1100100010100110")); // → false console.log(notBinary.test("1100100010200110")); // → true

Repeating parts of the template

We know how to find one number. What if we need to find the whole number - a sequence of one or more digits?

If you put a + sign after something in the regular sequence, this will mean that this element can be repeated more than once. /\d+/ means one or more digits.

Console.log(/"\d+"/.test(""123"")); // → true console.log(/"\d+"/.test("""")); // → false console.log(/"\d*"/.test(""123"")); // → true console.log(/"\d*"/.test("""")); // → true

The asterisk * has almost the same meaning, but it allows the pattern to occur zero times. If something is followed by an asterisk, then it never prevents the pattern from being in the line - it just appears there zero times.

A question mark makes part of the pattern optional, meaning it can occur zero or once. In the following example, the character u may appear, but the pattern matches even when it does not.

Var neighbor = /neighbou?r/; console.log(neighbor.test("neighbor")); // → true console.log(neighbor.test("neighbor")); // → true

Curly braces are used to specify the exact number of times a pattern must occur. (4) after an element means that it must appear 4 times in the line. You can also specify a gap: (2,4) means that the element must occur at least 2 and no more than 4 times.

Another version of the date and time format, where days, months and hours of one or two digits are allowed. And it's also a little more readable.

Var dateTime = /\d(1,2)-\d(1,2)-\d(4) \d(1,2):\d(2)/; console.log(dateTime.test("30-1-2003 8:45")); // → true

You can use open-ended spaces by omitting one of the numbers. (,5,) means that the pattern can occur from zero to five times, and (5,) means from five or more.

Grouping Subexpressions

To use the * or + operators on multiple elements at once, you can use parentheses. The part of the regular expression enclosed in brackets is considered one element from the point of view of operators.

Var cartoonCrying = /boo+(hoo+)+/i; console.log(cartoonCrying.test("Boohoooohoohoooo")); // → true

The first and second pluses only apply to the second o in the words boo and hoo. The third + refers to the whole group (hoo+), finding one or more such sequences.

The letter i at the end of the expression makes the regular expression case-insensitive - so that B matches b.

Matches and Groups

The test method is the simplest method for checking regular expressions. It only tells you whether a match was found or not. Regulars also have an exec method, which will return null if nothing was found, and otherwise return an object with information about the match.

Var match = /\d+/.exec("one two 100"); console.log(match); // → ["100"] console.log(match.index); // → 8

The object returned by exec has an index property, which contains the number of the character from which the match occurred. In general, the object looks like an array of strings, where the first element is the string that was checked for a match. In our example, this will be the sequence of numbers we were looking for.

Strings have a match method that works in much the same way.

Console.log("one two 100".match(/\d+/)); // → ["100"]

When a regular expression contains subexpressions grouped by parentheses, the text that matches these groups will also appear in the array. The first element is always a complete match. The second is the part that matched the first group (the one whose parentheses occurred first), then the second group, and so on.

Var quotedText = /"([^"]*)"/; console.log(quotedText.exec("she said "hello"")); // → [""hello"", "hello"]

When a group is not found at all (for example, if it is followed by a question mark), its position in the array is undefined. If a group matches several times, then only the last match will be in the array.

Console.log(/bad(ly)?/.exec("bad")); // → ["bad", undefined] console.log(/(\d)+/.exec("123")); // → ["123", "3"]

Groups are useful for retrieving parts of strings. If we don't just want to check whether a string has a date, but extract it and create an object representing the date, we can enclose the sequences of numbers in parentheses and select the date from the result of exec.

But first, a little digression in which we will learn the preferred way to store date and time in JavaScript.

Date type

JavaScript has a standard object type for dates—more specifically, moments in time. It's called Date. If you simply create a date object using new, you will get the current date and time.

Console.log(new Date()); // → Sun Nov 09 2014 00:07:57 GMT+0300 (CET)

You can also create an object containing a given time

Console.log(new Date(2015, 9, 21)); // → Wed Oct 21 2015 00:00:00 GMT+0300 (CET) console.log(new Date(2009, 11, 9, 12, 59, 59, 999)); // → Wed Dec 09 2009 12:59:59 GMT+0300 (CET)

JavaScript uses a convention where month numbers start with a zero and day numbers start with a one. This is stupid and ridiculous. Be careful.

The last four arguments (hours, minutes, seconds and milliseconds) are optional and are set to zero if missing.

Timestamps are stored as the number of milliseconds that have passed since the beginning of 1970. For times before 1970, negative numbers are used (this is due to the Unix time convention that was created around that time). The date object's getTime method returns this number. It is naturally big.
console.log(new Date(2013, 11, 19).getTime()); // → 1387407600000 console.log(new Date(1387407600000)); // → Thu Dec 19 2013 00:00:00 GMT+0100 (CET)

If you give the Date constructor one argument, it is treated as this number of milliseconds. You can get the current millisecond value by creating a Date object and calling the getTime method, or by calling the Date.now function.

The Date object has methods getFullYear, getMonth, getDate, getHours, getMinutes, and getSeconds to retrieve its components. There is also a getYear method that returns a rather useless two-digit code like 93 or 14.

By enclosing the relevant parts of the template in parentheses, we can create a date object directly from the string.

Function findDate(string) ( var dateTime = /(\d(1,2))-(\d(1,2))-(\d(4))/; var match = dateTime.exec(string); return new Date(Number(match), Number(match) - 1, Number(match)); ) console.log(findDate("30-1-2003")); // → Thu Jan 30 2003 00:00:00 GMT+0100 (CET)

Word and line boundaries

Unfortunately, findDate will just as happily extract the meaningless date 00-1-3000 from the string "100-1-30000". The match can happen anywhere in the string, so in this case it will simply start at the second character and end at the second to last character.

If we need to force the match to take the entire string, we use the ^ and $ tags. ^ matches the beginning of the line, and $ matches the end. Therefore, /^\d+$/ matches a string containing only one or more digits, /^!/ matches a string starting with an exclamation point, and /x^/ does not match any string (there cannot be a x).

If, on the other hand, we just want to make sure that the date starts and ends on a word boundary, we use the \b mark. A word boundary can be the beginning or end of a line, or any place in a line where there is an alphanumeric character \w on one side and a non-alphanumeric character on the other.

Console.log(/cat/.test("concatenate")); // → true console.log(/\bcat\b/.test("concatenate")); // → false

Note that the boundary label is not a symbol. It's simply a constraint, meaning that a match only occurs if a certain condition is met.

Templates with choice

Let's say you need to find out whether the text contains not just a number, but a number followed by pig, cow, or chicken in the singular or plural.

It would be possible to write three regular expressions and check them one by one, but there is a better way. Symbol | denotes a choice between the patterns to the left and to the right of it. And we can say the following:

Var animalCount = /\b\d+ (pig|cow|chicken)s?\b/; console.log(animalCount.test("15 pigs")); // → true console.log(animalCount.test("15 pigchickens")); // → false

Parentheses delimit the portion of the pattern to which | is applied, and many such operators can be placed one after the other to indicate a choice from more than two options.

Search engine

Regular expressions can be thought of as flowcharts. The following diagram describes a recent livestock example.

An expression matches a string if it is possible to find a path from the left side of the diagram to the right. We remember the current position in the line, and each time we go through the rectangle, we check that the part of the line immediately after our position in it matches the contents of the rectangle.

This means that checking if our regular character matches the string “the 3 pigs” when going through the flowchart looks like this:

At position 4 there is a word boundary, and we pass the first rectangle
- starting from the 4th position we find the number and go through the second rectangle
- at position 5, one path closes back in front of the second rectangle, and the second goes further to the rectangle with a space. We have a space, not a number, and we choose the second path.
- now we are at position 6, the beginning of “pigs”, and at the triple branching of the paths. There is no “cow” or “chicken” in the line, but there is “pig”, so we choose this path.
- at position 9 after the triple fork, one path bypasses “s” and goes to the last word boundary rectangle, and the second goes through “s”. We have an “s” so we go there.
- at position 10 we are at the end of the line, and only the word boundary can match. The end of the line is considered the boundary, and we pass through the last rectangle. And now we have successfully found our template.

Basically, the way regular expressions work is that the algorithm starts at the beginning of the string and tries to find a match there. In our case, there is a word boundary, so it passes the first rectangle - but there is no number there, so it stumbles on the second rectangle. Then it moves to the second character in the string, and tries to find a match there... And so on until it finds a match or gets to the end of the string, in which case no match is found.

Kickbacks

The regular expression /\b(+b|\d+|[\da-f]h)\b/ matches either a binary number followed by a b, a decimal number without a suffix, or a hexadecimal number (the numbers 0 to 9 or the symbols from a to h), followed by h. Relevant diagram:

When searching for a match, it may happen that the algorithm takes the top path (binary number), even if there is no such number in the string. If there is a line “103”, for example, it is clear that only after reaching the number 3 the algorithm will understand that it is on the wrong path. In general, the line matches the regular sequence, just not in this thread.

Then the algorithm rolls back. At a fork, it remembers the current position (in our case, this is the beginning of the line, just after the word boundary) so that you can go back and try another path if the chosen one does not work. For the string “103”, after encountering a three, it will go back and try to go through the decimal path. This will work so a match will be found.

The algorithm stops as soon as it finds a complete match. This means that even if several options may be suitable, only one of them is used (in the order in which they appear in the regular sequence).

Backtracking occurs when using repetition operators such as + and *. If you search for /^.*x/ in the string "abcxe", the regex part.* will try to consume the entire string. The algorithm will then realize that it also needs “x”. Since there is no “x” after the end of the string, the algorithm will try to look for a match by moving back one character. After abcx there is also no x, then it rolls back again, this time to the substring abc. And after the line, it finds x and reports a successful match, in positions 0 to 4.

You can write a regular routine that will lead to multiple rollbacks. This problem occurs when the pattern can match the input in many different ways. For example, if we make a mistake when writing the regular expression for binary numbers, we might accidentally write something like /(+)+b/.

If the algorithm were to look for such a pattern in a long string of 0s and 1s that didn't have a "b" at the end, it would first go through the inner loop until it ran out of digits. Then he will notice that there is no “b” at the end, he will roll back one position, go through the outer loop, give up again, try to roll back to another position along the inner loop... And he will continue to search in this way, using both loops. That is, the amount of work with each character of the line will double. Even for several dozen characters, finding a match will take a very long time.

replace method

Strings have a replace method that can replace part of a string with another string.

Console.log("dad".replace("p", "m")); // → map

The first argument can also be a regular expression, in which case the first occurrence of the regular expression in the line is replaced. When the “g” (global) option is added to the regular expression, all occurrences are replaced, not just the first

Console.log("Borobudur".replace(//, "a")); // → Barobudur console.log("Borobudur".replace(//g, "a")); // → Barabadar

It would make sense to pass the "replace all" option through a separate argument, or through a separate method like replaceAll. But unfortunately, the option is transmitted through the regular system itself.

The full power of regular expressions is revealed when we use links to groups found in a string, specified in the regular expression. For example, we have a line containing people's names, one name per line, in the format "Last Name, First Name". If we need to swap them and remove the comma to get “First Name Last Name,” we write the following:

Console.log("Hopper, Grace\nMcCarthy, John\nRitchie, Dennis" .replace(/([\w ]+), ([\w ]+)/g, "$2 $1")); // → Grace Hopper // John McCarthy // Dennis Ritchie

$1 and $2 in the replacement line refer to groups of characters enclosed in parentheses. $1 is replaced with the text that matches the first group, $2 with the second group, and so on, up to $9. The entire match is contained in the $& variable.

You can also pass a function as the second argument. For each replacement, a function will be called whose arguments will be the found groups (and the entire matching part of the line), and its result will be inserted into a new line.

Simple example:

Var s = "the cia and fbi"; console.log(s.replace(/\b(fbi|cia)\b/g, function(str) ( return str.toUpperCase(); ))); // → the CIA and FBI

Here's a more interesting one:

Var stock = "1 lemon, 2 cabbages, and 101 eggs"; function minusOne(match, amount, unit) ( amount = Number(amount) - 1; if (amount == 1) // only one left, remove the "s" at the end unit = unit.slice(0, unit.length - 1); else if (amount == 0) amount = "no"; return amount + " " + unit; ) console.log(stock.replace(/(\d+) (\w+)/g, minusOne)); // → no lemon, 1 cabbage, and 100 eggs

The code takes a string, finds all occurrences of numbers followed by a word, and returns a string with each number reduced by one.

The group (\d+) goes into the amount argument, and (\w+) goes into the unit argument. The function converts amount to a number - and this always works, because our pattern is \d+. And then makes changes to the word, in case there is only 1 item left.

Greed

It's easy to use replace to write a function that removes all comments from JavaScript code. Here's the first try:

Function stripComments(code) ( return code.replace(/\/\/.*|\/\*[^]*\*\//g, ""); ) console.log(stripComments("1 + /* 2 */3")); // → 1 + 3 console.log(stripComments("x = 10;// ten!")); // → x = 10; console.log(stripComments("1 /* a */+/* b */ 1")); // → 1 1

The part before the "or" operator matches two slashes followed by any number of characters except newlines. The part that removes multi-line comments is more complex. We use [^], i.e. any character that is not empty as a way to find any character. We can't use a period because block comments continue on a new line, and the newline character does not match the period.

But the output of the previous example is incorrect. Why?

The [^]* part will first try to capture as many characters as it can. If because of this the next part of the regular sequence does not find a match, it will roll back one character and try again. In the example, the algorithm tries to grab the entire line, and then rolls back. Having rolled back 4 characters, he will find */ in the line - and this is not what we wanted. We wanted to grab only one comment, and not go to the end of the line and find the last comment.

Because of this, we say that the repetition operators (+, *, ?, and ()) are greedy, meaning they first grab as much as they can and then go back. If you put a question after an operator like this (+?, *?, ??, ()?), they will turn into non-greedy, and start finding the smallest possible occurrences.

And that's what we need. By forcing the asterisk to find matches in the minimum possible number of characters in a line, we consume only one block of comments, and no more.

Function stripComments(code) ( return code.replace(/\/\/.*|\/\*[^]*?\*\//g, ""); ) console.log(stripComments("1 /* a */+/* b */ 1")); // → 1 + 1

Many errors occur when using greedy operators instead of non-greedy ones. When using the repeat operator, always consider the non-greedy operator first.

Dynamically creating RegExp objects

In some cases, the exact pattern is unknown at the time the code is written. For example, you will need to look for the user's name in the text, and enclose it in underscores. Since you will only know the name after running the program, you cannot use slash notation.

But you can construct the string and use the RegExp constructor. Here's an example:

Var name = "harry"; var text = "And Harry has a scar on his forehead."; var regexp = new RegExp("\\b(" + name + ")\\b", "gi"); console.log(text.replace(regexp, "_$1_")); // → And _Harry_ has a scar on his forehead.

When creating word boundaries, we have to use double slashes because we write them in a normal line, and not in a regular sequence with forward slashes. The second argument to RegExp contains options for regular expressions - in our case “gi”, i.e. global and case-insensitive.

But what if the name is “dea+hlrd” (if our user is a kulhatzker)? As a result, we will get a meaningless regular expression that will not find matches in the string.

We can add backslashes before any character we don't like. We can't add backslashes before letters because \b or \n are special characters. But you can add slashes before any non-alphanumeric characters without any problems.

Var name = "dea+hlrd"; var text = "This dea+hlrd is annoying everyone."; var escaped = name.replace(/[^\w\s]/g, "\\$&"); var regexp = new RegExp("\\b(" + escaped + ")\\b", "gi"); console.log(text.replace(regexp, "_$1_")); // → This _dea+hlrd_ annoyed everyone.

search method

The indexOf method cannot be used with regular expressions. But there is a search method that just expects regular expression. Like indexOf, it returns the index of the first occurrence, or -1 if none occurs.

Console.log(" word".search(/\S/)); // → 2 console.log(" ".search(/\S/)); // → -1

Unfortunately, there is no way to tell the method to look for a match starting at a specific offset (as you can do with indexOf). That would be helpful.

lastIndex property

The exec method also does not provide a convenient way to start the search from a given position in the string. But it gives an inconvenient way.

A regex object has properties. One of them is source, which contains a string. Another one is lastIndex, which controls, under some conditions, where the next search for occurrences will begin.

These conditions include that the global option g must be present, and that the search must be done using the exec method. A more reasonable solution would be to simply allow an extra argument to be passed to exec, but reasonableness is not a fundamental feature of the JavaScript regex interface.

Var pattern = /y/g; pattern.lastIndex = 3; var match = pattern.exec("xyzzy"); console.log(match.index); // → 4 console.log(pattern.lastIndex); // → 5

If the search was successful, the exec call updates the lastIndex property to point to the position after the found occurrence. If there was no success, lastIndex is set to zero - just like the lastIndex of the newly created object.

When using a global regular variable and multiple exec calls, these automatic lastIndex updates can cause problems. Your regular server can start searching from the position left from the previous call.

Var digit = /\d/g; console.log(digit.exec("here it is: 1")); // → ["1"] console.log(digit.exec("and now: 1")); // → null

Another interesting effect of the g option is that it changes how the match method works. When called with this option, instead of returning an array similar to the result of exec, it finds all occurrences of the pattern in the string and returns an array of the found substrings.

Console.log("Banana".match(/an/g)); // → ["an", "an"]

So be careful with global regular variables. The cases where they are needed - replace calls or places where you specifically use lastIndex - are probably all the cases in which they should be used.

Occurrence cycles

A typical task is to iterate through all occurrences of a pattern in a string so that it can access the match object in the body of the loop using lastIndex and exec.

Var input = "A line with 3 numbers in it... 42 and 88."; var number = /\b(\d+)\b/g; var match; while (match = number.exec(input)) console.log("Found ", match, " on ", match.index); // → Found 3 by 14 // Found 42 by 33 // Found 88 by 40

It takes advantage of the fact that the value of the assignment is the value being assigned. By using match = re.exec(input) as a condition in a while loop, we search at the beginning of each iteration, store the result in a variable, and end the loop when all matches are found.

Parsing INI files

To conclude the chapter, let's look at a problem using regular expressions. Imagine that we are writing a program that collects information about our enemies via the Internet automatically. (We won’t write the entire program, just the part that reads the settings file. Sorry.) The file looks like this:

Searchengine=http://www.google.com/search?q=$1 spitefulness=9.7 ; a semicolon is placed before comments; each section refers to a different enemy fullname=Larry Doe type=kindergarten bull website=http://www.geocities.com/CapeCanaveral/11451 fullname=Gargamel type=evil wizard outputdir=/home/marijn/enemies/gargamel

The exact file format (which is quite widely used, and is usually called INI) is as follows:

Blank lines and lines starting with a semicolon are ignored
- lines enclosed in square brackets begin a new section
- lines containing an alphanumeric identifier followed by = add a setting in this section

Everything else is incorrect data.

Our task is to convert such a string into an array of objects, each with a name property and an array of settings. One object is needed for each section, and another one is needed for global settings on top of the file.

Since the file needs to be parsed line by line, it's a good idea to start by breaking the file into lines. To do this, we used string.split("\n") in Chapter 6. Some operating systems use not one \n character for line breaks, but two - \r\n. Since the split method takes regular expressions as an argument, we can split lines using the expression /\r?\n/, allowing both single \n and \r\n between lines.

Function parseINI(string) ( // Let's start with an object containing top-level settings var currentSection = (name: null, fields: ); var categories = ; string.split(/\r?\n/).forEach(function(line ) ( var match; if (/^\s*(;.*)?$/.test(line)) ( return; ) else if (match = line.match(/^\[(.*)\]$ /)) ( currentSection = (name: match, fields: ); categories.push(currentSection); ) else if (match = line.match(/^(\w+)=(.*)$/)) ( currentSection. fields.push((name: match, value: match)); ) else ( throw new Error("The line "" + line + "" contains invalid data."); ) )); return categories; )

The code goes through all the lines, updating the current section object “current section”. First, it checks whether the line can be ignored using the regular expression /^\s*(;.*)?$/. Can you imagine how this works? The part between the brackets matches the comments, eh? makes it so that the regular character will also match lines consisting of only spaces.

If the line is not a comment, the code checks to see if it starts a new section. If yes, it creates a new object for the current section, to which subsequent settings are added.

The last meaningful possibility is that the string is a normal setting, in which case it is added to the current object.

If none of the options work, the function throws an error.

Notice how the frequent use of ^ and $ ensures that the expression matches the entire string rather than just part of it. If you don't use them, the code will generally work, but will sometimes produce strange results and the error will be difficult to track down.

The if (match = string.match(...)) construct is similar to the trick of using assignment as a condition in a while loop. Often you don't know that the match call will succeed, so you can only access the result object inside an if block that checks for it. In order not to break the beautiful chain of if checks, we assign the search result to a variable and immediately use this assignment as a check.

International symbols

Due to the initially simple implementation of the language, and the subsequent fixation of such an implementation “in granite,” JavaScript regular expressions are stupid with characters that are not found in the English language. For example, the “letter” character, from the point of view of JavaScript regular expressions, can be one of the 26 letters of the English alphabet, and for some reason also an underscore. Letters like é or β, which are clearly letters, do not match \w (and will match \W, which is a non-letter).

In a strange twist, historically \s (space) matches all characters that are considered whitespace in Unicode, including things like the non-breaking space or the Mongolian vowel separator.

Some regex implementations in other languages ​​have special syntax for searching for special categories of Unicode characters, such as "all caps", "all punctuation" or "control characters". There are plans to add such categories to JavaScript, but they will probably not be implemented soon.

Bottom line

Regulars are objects that represent search patterns in strings. They use their own syntax to express these patterns.

/abc/ Character sequence
// Any character from the list
/[^abc]/ Any character except characters from the list
// Any character from the interval
/x+/ One or more occurrences of the pattern x
/x+?/ One or more occurrences, non-greedy
/x*/ Zero or more occurrences
/x?/ Zero or one occurrence
/x(2,4)/ From two to four occurrences
/(abc)/ Group
/a|b|c/ Any of several patterns
/\d/ Any number
/\w/ Any alphanumeric character (“letter”)
/\s/ Any whitespace character
/./ Any character except newlines
/\b/ Word boundary
/^/ Start of line
/$/ End of line

The regex has a test method to check whether the pattern is in the string. There is an exec method that returns an array containing all the groups found. The array has an index property, which contains the number of the character from which the match occurred.

Strings have a match method to match patterns, and a search method that returns only the starting position of the occurrence. The replace method can replace occurrences of a pattern with another string. In addition, you can pass a function to replace that will build a replacement line based on the template and found groups.

Regular characters have settings that are written after the closing slash. The i option makes the regular expression case-insensitive, and the g option makes it global, which, among other things, causes the replace method to replace all occurrences found, not just the first one.

The RegExp constructor can be used to create regular expressions from strings.

Regulators are a sharp instrument with an uncomfortable handle. They greatly simplify some tasks, and can become unmanageable when solving other, complex problems. Part of learning to use regexes is to be able to resist the temptation to stuff them with a task for which they are not intended.

Exercises

Inevitably, when solving problems, you will encounter incomprehensible cases, and you may sometimes despair when you see the unpredictable behavior of some regular expressions. Sometimes it helps to study the behavior of a regular engine through an online service like debuggex.com, where you can see its visualization and compare it with the desired effect.
Regular golf
“Golf” in code is a game where you need to express a given program in a minimum number of characters. Regular golf is a practical exercise in writing the smallest possible regulars to find a given pattern, and only that.

For each of the sublines, write a regular expression to check their location in the line. The regular engine should find only these specified substrings. Don't worry about word boundaries unless specifically mentioned. When you have a working regular pattern, try reducing it.

Car and cat
- pop and prop
- ferret, ferry, and ferrari
- Any word ending in ious
- A space followed by a period, comma, colon or semicolon.
- A word longer than six letters
- Word without letters e

// Enter your regular expressions verify(/.../, ["my car", "bad cats"], ["camper", "high art"]); verify(/.../, ["pop culture", "mad props"], ["plop"]); verify(/.../, ["ferret", "ferry", "ferrari"], ["ferrum", "transfer A"]); verify(/.../, ["how delicious", "spacious room"], ["ruinous", "consciousness"]); verify(/.../, ["bad punctuation ."], ["escape the dot"]); verify(/.../, ["hottenottententen"], ["no", "hotten totten tenten"]); verify(/.../, ["red platypus", "wobbling nest"], ["earth bed", "learning ape"]); function verify(regexp, yes, no) ( // Ignore unfinished exercises if (regexp.source == "...") return; yes.forEach(function(s) ( if (!regexp.test(s)) console .log("Not found "" + s + """); )); no.forEach(function(s) ( if (regexp.test(s)) console.log("Unexpected occurrence of "" + s + " ""); )); )

Quotes in text
Let's say you wrote a story and used single quotes throughout to indicate dialogue. Now you want to replace the dialogue quotes with double quotes, and leave the single quotes in abbreviations for words like aren’t.

Come up with a pattern that distinguishes between these two uses of quotes, and write a call to the replace method that does the replacement.

Numbers again
Sequences of numbers can be found with a simple regular expression /\d+/.

Write an expression that finds only numbers written in JavaScript style. It must support a possible minus or plus before the number, a decimal point, and scientific notation 5e-3 or 1E10 - again with a possible plus or minus. Also note that there may not necessarily be numbers before or after the dot, but the number cannot consist of a single dot. That is, .5 or 5. are valid numbers, but one dot by itself is not.

// Enter the regular sequence here. var number = /^...$/; // Tests: ["1", "-1", "+15", "1.55", ".5", "5.", "1.3e2", "1E-4", "1e+12"] .forEach(function(s) ( if (!number.test(s)) console.log("Did not find "" + s + """); )); ["1a", "+-1", "1.2.3", "1+1", "1e4.5", ".5.", "1f5", "."].forEach(function(s) ( if (number.test(s)) console.log("Incorrectly accepted "" + s + """); ));

Modifiers

The minus symbol (-) placed next to a modifier (except for U) creates its negation.

Special characters

AnalogueDescription
() subpattern, nested expression
wildcard
(a,b) number of occurrences from "a" to "b"
| logical "or", in the case of single-character alternatives use
\ escape special character
. any character except line feed
\d decimal digit
\D[^\d]any character other than a decimal digit
\f end (page break)
\n line translation
\pL letter in UTF-8 encoding when using the u modifier
\r carriage return
\s[\t\v\r\n\f]space character
\S[^\s]any symbol except the flashing one
\t tabulation
\w any number, letter or underscore
\W[^\w]any character other than a number, letter, or underscore
\v vertical tab

Special characters within a character class

Position within a line

ExampleCorrespondenceDescription
^ ^aa aa aaastart of line
$ a$aaa aa a end of line
\A\Aaa aa aaa
aaa aaa
beginning of the text
\za\zaaa aaa
aaa aa a
end of text
\ba\b
\ba
aa a aa a
a aa a aa
word boundary, statement: the previous character is verbal, but the next one is not, or vice versa
\B\Ba\Ba a a a a ano word boundary
\G\Gaaaa aaaPrevious successful search, the search stopped at the 4th position - where a was not found
Download in PDF, PNG.

Anchors

Anchors in regular expressions indicate the beginning or end of something. For example, lines or words. They are represented by certain symbols. For example, a pattern matching a string starting with a number would look like this:

Here the ^ character denotes the beginning of the line. Without it, the pattern would match any string containing a digit.

Character classes

Character classes in regular expressions match a certain set of characters at once. For example, \d matches any number from 0 to 9 inclusive, \w matches letters and numbers, and \W matches all characters other than letters and numbers. The pattern identifying letters, numbers and space looks like this:

POSIX

POSIX is a relatively new addition to the regular expression family. The idea, as with character classes, is to use shortcuts that represent some group of characters.

Statements

Almost everyone has trouble understanding affirmations at first, but as you become more familiar with them, you'll find yourself using them quite often. Assertions provide a way to say, “I want to find every word in this document that includes the letter “q” and is not followed by “werty.”

[^\s]*q(?!werty)[^\s]*

The above code starts by searching for any characters other than space ([^\s]*) followed by q . The parser then reaches a forward-looking assertion. This automatically makes the preceding element (character, group, or character class) conditional—it will match the pattern only if the statement is true. In our case, the statement is negative (?!), that is, it will be true if what is being sought in it is not found.

So, the parser checks the next few characters against the proposed pattern (werty). If they are found, then the statement is false, which means the character q will be “ignored”, that is, it will not match the pattern. If werty is not found, then the statement is true, and everything is in order with q. Then the search continues for any characters other than space ([^\s]*).

Quantifiers

Quantifiers allow you to define a part of a pattern that must be repeated several times in a row. For example, if you want to find out whether a document contains a string of 10 to 20 (inclusive) letters "a", then you can use this pattern:

A(10,20)

By default, quantifiers are “greedy”. Therefore, the quantifier +, meaning “one or more times,” will correspond to the maximum possible value. Sometimes this causes problems, in which case you can tell the quantifier to stop being greedy (become "lazy") by using a special modifier. Look at this code:

".*"

This pattern matches text enclosed in double quotes. However, your source line could be something like this:

Hello World

The above template will find the following substring in this line:

"helloworld.htm" title="Hello World" !}

He turned out to be too greedy, grabbing the largest piece of text he could.

".*?"

This pattern also matches any characters enclosed in double quotes. But the lazy version (notice the modifier?) looks for the smallest possible occurrence, and will therefore find each double-quoted substring individually:

"helloworld.htm" "Hello World"

Escaping in regular expressions

Regular expressions use certain characters to represent different parts of a pattern. However, a problem arises if you need to find one of these characters in a string, just like a regular character. A dot, for example, in a regular expression means “any character other than a line break.” If you need to find a point in a string, you can't just use " . » as a template - this will lead to finding almost anything. So, you need to tell the parser that this dot should be considered a regular dot and not "any character". This is done using an escape sign.

An escape character preceding a character such as a dot causes the parser to ignore its function and treat it as a normal character. There are several characters that require such escaping in most templates and languages. You can find them in the lower right corner of the cheat sheet (“Meta Symbols”).

The pattern for finding a point is:

\.

Other special characters in regular expressions match unusual elements in text. Line breaks and tabs, for example, can be typed on the keyboard but are likely to confuse programming languages. The escape character is used here to tell the parser to treat the next character as a special character rather than a regular letter or number.

Special escape characters in regular expressions

String substitution

String substitution is described in detail in the next paragraph, “Groups and Ranges,” but the existence of “passive” groups should be mentioned here. These are groups that are ignored during substitution, which is very useful if you want to use an "or" condition in a pattern, but do not want that group to take part in the substitution.

Groups and Ranges

Groups and ranges are very, very useful. It's probably easier to start with ranges. They allow you to specify a set of suitable characters. For example, to check whether a string contains hexadecimal digits (0 to 9 and A to F), you would use the following range:

To check the opposite, use a negative range, which in our case fits any character except numbers from 0 to 9 and letters from A to F:

[^A-Fa-f0-9]

Groups are most often used when an "or" condition is needed in a pattern; when you need to refer to part of a template from another part of it; and also when substituting strings.

Using "or" is very simple: the following pattern looks for "ab" or "bc":

If in a regular expression it is necessary to refer to one of the previous groups, you should use \n , where instead of n substitute the number of the desired group. You may want a pattern that matches the letters "aaa" or "bbb" followed by a number and then the same three letters. This pattern is implemented using groups:

(aaa|bbb)+\1

The first part of the pattern looks for "aaa" or "bbb", combining the letters found into a group. This is followed by a search for one or more digits (+), and finally \1. The last part of the pattern references the first group and looks for the same thing. It looks for a match with the text already found by the first part of the pattern, not a match to it. So "aaa123bbb" will not satisfy the above pattern since \1 will look for "aaa" after the number.

One of the most useful tools in regular expressions is string substitution. When replacing text, you can reference the found group using $n . Let's say you want to highlight all the words "wish" in text in bold. To do this, you should use a regular expression replace function, which might look like this:

Replace(pattern, replacement, subject)

The first parameter will be something like this (you may need a few extra characters for this particular function):

([^A-Za-z0-9])(wish)([^A-Za-z0-9])

It will find any occurrences of the word "wish" along with the previous and next characters, as long as they are not letters or numbers. Then your substitution could be like this:

$1$2$3

It will replace the entire string found using the pattern. We start replacing with the first character found (that is not a letter or a number), marking it $1 . Without this, we would simply remove this character from the text. The same goes for the end of the substitution ($3). In the middle we've added an HTML tag for bold (of course, you can use CSS or ), allocating them the second group found using the template ($2).

Template modifiers

Template modifiers are used in several languages, most notably Perl. They allow you to change how the parser works. For example, the i modifier causes the parser to ignore cases.

Regular expressions in Perl are surrounded by the same character at the beginning and at the end. This can be any character (most often “/” is used), and it looks like this:

/pattern/

Modifiers are added to the end of this line, like this:

/pattern/i

Meta characters

Finally, the last part of the table contains meta characters. These are characters that have special meaning in regular expressions. So if you want to use one of them as a regular character, then it needs to be escaped. To check for the presence of a parenthesis in the text, use the following pattern:

The cheat sheet is a general guide to regular expression patterns without taking into account the specifics of any language. It is presented in the form of a table that fits on one printed sheet of A4 size. Created under a Creative Commons license based on a cheat sheet authored by Dave Child. Download in PDF, PNG.


new RegExp(pattern[, flags])

regular expression ADVANCE

It is known that literal syntax is preferred(/test/i).

If the regular expression is not known in advance, then it is preferable to create a regular expression (in a character string) using the constructor (new RegExp).

But pay attention, since the “slash sign” \ plays the role of code switching, it has to be written twice in the string literal (new RegExp): \\

Flags

i ignore case when matching

g global matching, unlike local matching (by default, matches only the first instance of the pattern), allows matches of all instances of the pattern

Operators

What How Description Usage
i flag does reg. case insensitive expression /testik/i
g flag global search /testik/g
m flag allows matching against many strings that can be obtained from textarea
character class operator character set matching - any character in the range from a to z;
^ caret operator except [^a-z] - any character EXCEPT characters in the range from a to z;
- hyphen operator indicate the range of values, inclusive - any character in the range from a to z;
\ escape operator escapes any following character \\
^ start matching operator pattern matching must happen at the beginning /^testik/g
$ end-of-matching operator pattern matching should happen at the end /testik$/g
? operator? makes the character optional /t?est/g
+ operator + /t+est/g
+ operator + the symbol must be present once or more than once /t+est/g
* operator * the symbol must be present once or repeatedly or be absent altogether /t+est/g
{} operator() set a fixed number of character repetitions /t(4)est/g
{,} operator (,) set the number of repetitions of a symbol within certain limits /t(4,9)est/g

Predefined Character Classes

Predefined member Comparison
\t horizontal tab
\n Line translation
. Any character other than Line Feed
\d Any tenth digit, which is equivalent
\D Any character other than the tenth digit, which is equivalent to [^0-9]
\w Any character (numbers, letters and underscores) that is equivalent
\W Any character other than numbers, letters, and underscores, which is equivalent to [^A-Za-z0-9]
\s Any space character
\S Any character except space
\b Word boundary
\B NOT the boundary of the word, but its internal. Part

Grouping()

If you want to apply an operator such as + (/(abcd)+/) to a group of members, you can use parentheses () .

Fixations

The part of the regular expression enclosed in parentheses () is called fixation.

Consider the following example:

/^()k\1/

\1 is not any character from a , b , c .
\1 is any character that initiates match the first character. That is, the character that matches \1 is unknown until the regular expression is resolved.

Unfixed groups

Brackets () are used in 2 cases: for grouping and for denoting fixations. But there are situations when we need to use () only for grouping, since commits are not required, in addition, by removing unnecessary commits we make it easier for the regular expression processing mechanism.

So to prevent fixation Before the opening parenthesis you need to put: ?:

str = "

Hello world!
"; found = str.match(/<(?:\/?)(?:\w+)(?:[^>]*?)>/i); console.log("found without fix: ", found); // [ "
" ]

test function

Regexp.test()

The test function checks whether the regular expression matches the string (str). Returns either true or false .

Usage example:

Javascript

function codeF(str)( return /^\d(5)-\d(2)/.test(str); ) //console.log(codeF("12345-12ss")); // true //console.log(codeF("1245-12ss")); // false

match function

str.match(regexp)

The match function returns an array of values ​​or null if no matches are found. Check: if the regular expression does not have the g flag (to perform a global search), then the match method will return the first match in the string, and, as can be seen from the example, in an array of matches FIXATIONS fall(part of the regular expression enclosed in parentheses).

Javascript

str = "For information, please refer to: Chapter 3.4.5.1"; re = /chapter (\d+(\.\d)*)/i // with commits (without global flag) found = str.match(re) console.log(found); // ["Chapter 3.4.5.1", "3.4.5.1", ".1"]

If you provide the match() method with a global regular expression (with the g flag), then an array will also be returned, but with GLOBAL matches. That is, the recorded results are not returned.

Javascript

str = "For information, refer to: Chapter 3.4.5.1, Chapter 7.5"; re = /chapter (\d+(\.\d)*)/ig // without commits - globally found = str.match(re) console.log(found); // ["Chapter 3.4.5.1", "Chapter 7.5"]

exec function

regexp.exec(str)

The exec function checks whether a regular expression matches a string (str). Returns an array of results (with commits) or null . Each subsequent call to exec (for example, when using while) occurs (by exec automatically updating the index of the end of the last search, lastIndex), and moves to the next global match (if the g flag is specified).

Javascript

var html = "
BAM! BUM!
"; var reg = /<(\/?)(\w+)([^>]*?)>/g; //console.log(reg.exec(html)); // ["
", "", "div", " class="test""] while((match = reg.exec(html)) !== null)( console.log(reg.exec(html)); ) /* [" ", "", "b", ""] [" ", "", "em", ""] ["
", "/", "div", ""] */

Without the global flag, the match and exec methods work identically. That is, they return an array with the first global match and commits.

Javascript

// match var html = "
BAM! BUM!
"; var reg = /<(\/?)(\w+)([^>]*?)>/; // without global console.log(html.match(reg)); // ["
", "", "div", " class="test""] // exec var html = "
BAM! BUM!
"; var reg = /<(\/?)(\w+)([^>]*?)>/; // without global console.log(reg.exec(html)); // ["
", "", "div", " class="test""]

replace function

str.replace(regexp, newSubStr|function)
  • regexp - reg. expression;
  • newSubStr - the string to which the found expression in the text is changed;
  • function - called for each match found with a variable list of parameters (recall that a global search in a string finds all instances of a pattern match).

The return value of this function serves as a replacement.

Function parameters:

  • 1 - Complete matched substring.
  • 2 - The meaning of bracket groups (fixations).
  • 3 - Index (position) of the match in the source string.
  • 4 - Source string.

The method does not change the calling string, but returns a new one after replacing the matches. To perform a global search and replace, use regexp with the g flag.

"GHGHGHGTTTT".replace(//g,"K"); //"KKKKKKKKKKK"

Javascript

function upLetter(allStr,letter) ( return letter.toUpperCase(); ) var res = "border-top-width".replace(/-(\w)/g, upLetter); console.log(res); //borderTopWidth