|
1 | 1 | ---
|
2 |
| -title: "Chapter 06 Standard Library: Regular Expression" |
| 2 | +title: "Chapter 06 Regular Expression" |
3 | 3 | type: book-en-us
|
4 | 4 | order: 6
|
5 | 5 | ---
|
6 | 6 |
|
7 |
| -# Chapter 06 Standard Library: Regular Expression |
| 7 | +# Chapter 06 Regular Expression |
8 | 8 |
|
9 |
| -[Table of Content](./toc.md) | [Previous Chapter](./05-pointers.md) | [Next Chapter: Standard Library: Threads and Concurrency](./07-thread.md) |
| 9 | +[TOC] |
| 10 | + |
| 11 | +## 6.1 Introduction |
| 12 | + |
| 13 | +Regular expressions are not part of the C++ language and therefore we only briefly |
| 14 | +introduced it here. |
| 15 | + |
| 16 | +Regular expressions describe a pattern of string matching. |
| 17 | +The general use of regular expressions is mainly to achieve |
| 18 | +the following three requirements: |
| 19 | + |
| 20 | +1. Check if a string contains some form of substring; |
| 21 | +2. Replace the matching substrings; |
| 22 | +3. Take the eligible substring from a string. |
| 23 | + |
| 24 | +Regular expressions are text patterns consisting of ordinary characters (such as a to z) |
| 25 | +and special characters. A pattern describes one or more strings to match when searching for text. |
| 26 | +Regular expressions act as a template to match a character pattern to the string being searched. |
| 27 | + |
| 28 | +### Ordinary characters |
| 29 | + |
| 30 | +Normal characters include all printable and unprintable characters that |
| 31 | +are not explicitly specified as metacharacters. This includes all uppercase |
| 32 | +and lowercase letters, all numbers, all punctuation, and some other symbols. |
| 33 | + |
| 34 | +### Special characters |
| 35 | + |
| 36 | +A special character is a character with special meaning in a regular expression, |
| 37 | +and is also the core matching syntax of a regular expression. See the table below: |
| 38 | + |
| 39 | +|Special characters|Description| |
| 40 | +|:---:|:------------------------------------------------------| |
| 41 | +|`$`| Matches the end position of the input string. | |
| 42 | +|`(`,`)`| Marks the start and end of a subexpression. Subexpressions can be obtained for later use. | |
| 43 | +|`*`| Matches the previous subexpression zero or more times. | |
| 44 | +|`+`| Matches the previous subexpression one or more times. | |
| 45 | +|`.`| Matches any single character except the newline character `\n`. | |
| 46 | +|`[`| Marks the beginning of a bracket expression. | |
| 47 | +|`?`| Matches the previous subexpression zero or one time, or indicates a non-greedy qualifier. | |
| 48 | +| `\`| Marks the next character as either a special character, or a literal character, or a backward reference, or an octal escape character. For example, `n` Matches the character `n`. `\n` matches newline characters. The sequence `\\` Matches the `'\'` character, while `\(` matches the `'('` character.| |
| 49 | +|`^`| Matches the beginning of the input string, unless it is used in a square bracket expression, at which point it indicates that the set of characters is not accepted. | |
| 50 | +|`{`| Marks the beginning of a qualifier expression. | |
| 51 | +|`\`| Indicates a choice between the two. | |
| 52 | + |
| 53 | +### Quantifiers |
| 54 | + |
| 55 | +The qualifier is used to specify how many times a given component of a regular expression must appear to satisfy the match. See the table below: |
| 56 | + |
| 57 | +|Character|Description| |
| 58 | +|:---:|:------------------------------------------------------| |
| 59 | +|`*`| matches the previous subexpression zero or more times. For example, `foo*` matches `fo` and `foooo`. `*` is equivalent to `{0,}`. | |
| 60 | +|`+`| matches the previous subexpression one or more times. For example, `foo+` matches `foo` and `foooo` but does not match `fo`. `+` is equivalent to `{1,}`. | |
| 61 | +|`?`| matches the previous subexpression zero or one time. For example, `Your(s)?` can match `Your` in `Your` or `Yours`. `?` is equivalent to `{0,1}`. | |
| 62 | +|`{n}`| `n` is a non-negative integer. Matches the determined `n` times. For example, `o{2}` cannot match `o` in `for`, but can match two `o` in `foo`. | |
| 63 | +|`{n,}`| `n` is a non-negative integer. Match at least `n` times. For example, `o{2,}` cannot match `o` in `for`, but matches all `o` in `foooooo`. `o{1,}` is equivalent to `o+`. `o{0,}` is equivalent to `o*`. | |
| 64 | +|`{n,m}`| `m` and `n` are non-negative integers, where `n` is less than or equal to `m`. Matches at least `n` times and matches up to `m` times. For example, `o{1,3}` will match the first three `o` in `foooooo`. `o{0,1}` is equivalent to `o?`. Note that there can be no spaces between the comma and the two numbers. | |
| 65 | + |
| 66 | +With these two tables, we can usually read almost all regular expressions. |
| 67 | + |
| 68 | +## 6.2 `std::regex` and Its Related |
| 69 | + |
| 70 | +The most common way to match string content is to use regular expressions. Unfortunately, in traditional C++, regular expressions have not been supported by the language level, and are not included in the standard library. C++ is a high-performance language. In the development of background services, the use of regular expressions is also used when judging URL resource links. The most mature and common practice in industry. |
| 71 | + |
| 72 | +The general solution is to use the regular expression library of `boost`. C++11 officially incorporates the processing of regular expressions into the standard library, providing standard support from the language level and no longer relying on third parties. |
| 73 | + |
| 74 | +The regular expression library provided by C++11 operates on the `std::string` object, and the pattern `std::regex` (essentially `std::basic_regex`) is initialized and matched by `std::regex_match` Produces `std::smatch` (essentially the `std::match_results` object). |
| 75 | + |
| 76 | +We use a simple example to briefly introduce the use of this library. Consider the following regular expression: |
| 77 | + |
| 78 | +- `[az]+\.txt`: In this regular expression, `[az]` means matching a lowercase letter, `+` can match the previous expression multiple times, so `[az]+` can Matches a string of lowercase letters. In the regular expression, a `.` means to match any character, and `\.` means to match the character `.`, and the last `txt` means to match `txt` exactly three letters. So the content of this regular expression to match is a text file consisting of pure lowercase letters. |
| 79 | + |
| 80 | +`std::regex_match` is used to match strings and regular expressions, and there are many different overloaded forms. The simplest form is to pass `std::string` and a `std::regex` to match. When the match is successful, it will return `true`, otherwise it will return `false`. For example: |
| 81 | + |
| 82 | +```cpp |
| 83 | +#include <iostream> |
| 84 | +#include <string> |
| 85 | +#include <regex> |
| 86 | + |
| 87 | +int main() { |
| 88 | + std::string fnames[] = {"foo.txt", "bar.txt", "test", "a0.txt", "AAA.txt"}; |
| 89 | + // In C++, `\` will be used as an escape character in the string. In order for `\.` to be passed as a regular expression, it is necessary to perform second escaping of `\`, thus we have `\\.` |
| 90 | + std::regex txt_regex("[a-z]+\\.txt"); |
| 91 | + for (const auto &fname: fnames) |
| 92 | + std::cout << fname << ": " << std::regex_match(fname, txt_regex) << std::endl; |
| 93 | +} |
| 94 | +``` |
| 95 | + |
| 96 | +Another common form is to pass in the three arguments `std::string`/`std::smatch`/`std::regex`. |
| 97 | +The essence of `std::smatch` is actually `std::match_results`. |
| 98 | +In the standard library, `std::smatch` is defined as `std::match_results<std::string::const_iterator>`, |
| 99 | +which means `match_results` of a substring iterator type. |
| 100 | +Use `std::smatch` to easily get the matching results, for example: |
| 101 | + |
| 102 | +```cpp |
| 103 | +std::regex base_regex("([a-z]+)\\.txt"); |
| 104 | +std::smatch base_match; |
| 105 | +for(const auto &fname: fnames) { |
| 106 | + if (std::regex_match(fname, base_match, base_regex)) { |
| 107 | + // the first element of std::smatch matches the entire string |
| 108 | + // the second element of std::smatch matches the first expression with brackets |
| 109 | + if (base_match.size() == 2) { |
| 110 | + std::string base = base_match[1].str(); |
| 111 | + std::cout << "sub-match[0]: " << base_match[0].str() << std::endl; |
| 112 | + std::cout << fname << " sub-match[1]: " << base << std::endl; |
| 113 | + } |
| 114 | + } |
| 115 | +} |
| 116 | +``` |
| 117 | +
|
| 118 | +The output of the above two code snippets is: |
| 119 | +
|
| 120 | +``` |
| 121 | +foo.txt: 1 |
| 122 | +bar.txt: 1 |
| 123 | +test: 0 |
| 124 | +a0.txt: 0 |
| 125 | +AAA.txt: 0 |
| 126 | +sub-match[0]: foo.txt |
| 127 | +foo.txt sub-match[1]: foo |
| 128 | +sub-match[0]: bar.txt |
| 129 | +bar.txt sub-match[1]: bar |
| 130 | +``` |
| 131 | +
|
| 132 | +## Conclusion |
| 133 | +
|
| 134 | +This section briefly introduces the regular expression itself, |
| 135 | +and then introduces the use of the regular expression library |
| 136 | +through a practical example based on the main requirements of |
| 137 | +using regular expressions. |
| 138 | +
|
| 139 | +[Table of Content](./toc.md) | [Previous Chapter](./05-pointers.md) | [Next Chapter: Threads and Concurrency](./07-thread.md) |
10 | 140 |
|
11 | 141 | ## Further Readings
|
12 | 142 |
|
| 143 | +1. [Comments from `std::regex`'s author](http://zhihu.com/question/23070203/answer/84248248) |
| 144 | +2. [Library document of Regular Expression](http://en.cppreference.com/w/cpp/regex) |
| 145 | +
|
13 | 146 | ## Licenses
|
14 | 147 |
|
15 | 148 | <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This work was written by [Ou Changkun](https://changkun.de) and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>. The code of this repository is open sourced under the [MIT license](../../LICENSE).
|
0 commit comments