Regex in Java
In computing, a regular expression, also referred to as
regex
or
regexp
, provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.
A regular expression, often called a pattern, is an expression that describes a set of strings. They are usually used to give a concise description of a set, without having to list all elements. For example, the set containing the three strings "Handel", "Händel", and "Haendel" can be described by the pattern H(ä|ae?)ndel (or alternatively, it is said that the pattern matches each of the three strings).
Characters
|
Characters
|
|
x
|
The character x
|
|
\\
|
The backslash character
|
|
\0n
|
The character with octal value 0n (0 <= n="" 7="" td="">
|
|
\0nn
|
The character with octal value 0nn (0 <= n="" 7="" td="">
|
|
\0mnn
|
The character with octal value 0mnn (0 <= m="" 3="" 0="" n="" 7="" td="">
|
|
\xhh
|
The character with hexadecimal value 0xhh
|
|
\uhhhh
|
The character with hexadecimal value 0xhhhh
|
|
\t
|
The tab character ('\u0009')
|
|
\n
|
The newline (line feed) character ('\u000A')
|
|
\r
|
The carriage-return character ('\u000D')
|
|
\f
|
The form-feed character ('\u000C')
|
|
\a
|
The alert (bell) character ('\u0007')
|
|
\e
|
The escape character ('\u001B')
|
|
\cx
|
The control character corresponding to x
|
Character classes
|
Character classes
|
|
[abc]
|
a, b, or c (simple class)
|
|
[^abc]
|
Any character except a, b, or c (negation)
|
|
[a-zA-Z]
|
a through z or A through Z, inclusive (range)
|
|
[a-d[m-p]]
|
a through d, or m through p: [a-dm-p] (union)
|
|
[a-z&&[def]]
|
d, e, or f (intersection)
|
|
[a-z&&[^bc]]
|
a through z, except for b and c: [ad-z] (subtraction)
|
|
[a-z&&[^m-p]]
|
a through z, and not m through p: [a-lq-z](subtraction)
|
Predefined character classes
|
Predefined character classes
|
|
.
|
Any character (may or may not match line terminators)
|
|
\d
|
A digit: [0-9]
|
|
\D
|
A non-digit: [^0-9]
|
|
\s
|
A whitespace character: [ \t\n\x0B\f\r]
|
|
\S
|
A non-whitespace character: [^\s]
|
|
\w
|
A word character: [a-zA-Z_0-9]
|
|
\W
|
A non-word character: [^\w]
|
POSIX character classes (US-ASCII only)
|
POSIX character classes (US-ASCII only)
|
|
|
\p{Lower}
|
A lower-case alphabetic character: [a-z]
|
|
\p{Upper}
|
An upper-case alphabetic character:[A-Z]
|
|
\p{ASCII}
|
All ASCII:[\x00-\x7F]
|
|
\p{Alpha}
|
An alphabetic character:[\p{Lower}\p{Upper}]
|
|
\p{Digit}
|
A decimal digit: [0-9]
|
|
\p{Alnum}
|
An alphanumeric character:[\p{Alpha}\p{Digit}]
|
|
\p{Punct}
|
Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
|
|
\p{Graph}
|
A visible character: [\p{Alnum}\p{Punct}]
|
|
\p{Print}
|
A printable character: [\p{Graph}]
|
|
\p{Blank}
|
A space or a tab: [ \t]
|
|
\p{Cntrl}
|
A control character: [\x00-\x1F\x7F]
|
|
\p{XDigit}
|
A hexadecimal digit: [0-9a-fA-F]
|
|
\p{Space}
|
A whitespace character: [ \t\n\x0B\f\r]
|
Classes for Unicode blocks and categories
|
Classes for Unicode blocks and categories
|
|
\p{InGreek}
|
A character in the Greek block (simple block)
|
|
\p{Lu}
|
An uppercase letter (simple category)
|
|
\p{Sc}
|
A currency symbol
|
|
\P{InGreek}
|
Any character except one in the Greek block (negation)
|
|
[\p{L}&&[^\p{Lu}]]
|
Any letter except an uppercase letter (subtraction)
|
Boundary matchers
|
Boundary matchers
|
|
^
|
The beginning of a line
|
|
$
|
The end of a line
|
|
\b
|
A word boundary
|
|
\B
|
A non-word boundary
|
|
\A
|
The beginning of the input
|
|
\G
|
The end of the previous match
|
|
\Z
|
The end of the input but for the final terminator, if any
|
|
\z
|
The end of the input
|
Greedy quantifiers
|
Greedy quantifiers
|
|
X?
|
X, once or not at all
|
|
X*
|
X, zero or more times
|
|
X+
|
X, one or more times
|
|
X{n}
|
X, exactly n times
|
|
X{n,}
|
X, at least n times
|
|
X{n,m}
|
X, at least n but not more than m times
|
Reluctant quantifiers
|
Reluctant quantifiers
|
|
X??
|
X, once or not at all
|
|
X*?
|
X, zero or more times
|
|
X+?
|
X, one or more times
|
|
X{n}?
|
X, exactly n times
|
|
X{n,}?
|
X, at least n times
|
|
X{n,m}?
|
X, at least n but not more than m times
|
Possessive quantifiers
|
Possessive quantifiers
|
|
X?+
|
X, once or not at all
|
|
X*+
|
X, zero or more times
|
|
X++
|
X, one or more times
|
|
X{n}+
|
X, exactly n times
|
|
X{n,}+
|
X, at least n times
|
|
X{n,m}+
|
X, at least n but not more than m times
|
Logical operators
|
Logical operators
|
|
XY
|
X followed by Y
|
|
X|Y
|
Either X or Y
|
|
(X)
|
X, as a capturing group
|
Back references
|
Back references
|
|
\n
|
Whatever the nth capturing group matched
|
Quotation
|
Quotation
|
|
\
|
Nothing, but quotes the following character
|
|
\Q
|
Nothing, but quotes all characters until \E
|
|
\E
|
Nothing, but ends quoting started by \Q
|
Special constructs (non-capturing)
|
Special constructs (non-capturing)
|
|
(?:X)
|
X, as a non-capturing group
|
|
(?idmsux-idmsux)
|
Nothing, but turns match flags on - off
|
|
(?idmsux-idmsux:X)
|
X, as a non-capturing group with the given flags on - off
|
|
(?=X)
|
X, via zero-width positive lookahead
|
|
(?!X)
|
X, via zero-width negative lookahead
|
|
(?<=X)
|
X, via zero-width positive lookbehind
|
|
(?
|
X, via zero-width negative lookbehind
|
|
(?>X)
|
X, as an independent, non-capturing group
|
Java Pattern class
A Pattern is a�compiled representation of a regular expression.�The Pattern class defines an alternate compile method that accepts a set of flags affecting the way the pattern is matched. The flags parameter is a bit mask that may include any of the following public static fields:
-
Pattern.CANON_EQ
Enables canonical equivalence. When this flag is specified, two characters will be considered to match if, and only if, their full canonical decompositions match. The expression "a\u030A", for example, will match the string "\u00E5" when this flag is specified. By default, matching does not take canonical equivalence into account. Specifying this flag may impose a performance penalty.
-
Pattern.CASE_INSENSITIVE
Enables case-insensitive matching. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this flag. Case-insensitive matching can also be enabled via the embedded flag expression (?i). Specifying this flag may impose a slight performance penalty.
-
Pattern.COMMENTS
Permits whitespace and comments in the pattern. In this mode, whitespace is ignored, and embedded comments starting with # are ignored until the end of a line. Comments mode can also be enabled via the embedded flag expression (?x).
-
Pattern.DOTALL
Enables dotall mode. In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators. Dotall mode can also be enabled via the embedded flag expression (?s). (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.)
-
Pattern.LITERAL
Enables literal parsing of the pattern. When this flag is specified then the input string that specifies the pattern is treated as a sequence of literal characters. Metacharacters or escape sequences in the input sequence will be given no special meaning. The flags CASE_INSENSITIVE and UNICODE_CASE retain their impact on matching when used in conjunction with this flag. The other flags become superfluous. There is no embedded flag character for enabling literal parsing.
-
Pattern.MULTILINE
Enables multiline mode. In multiline mode the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence. Multiline mode can also be enabled via the embedded flag expression (?m). Pattern.UNICODE_CASE Enables Unicode-aware case folding. When this flag is specified then case-insensitive matching, when enabled by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode Standard. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case folding can also be enabled via the embedded flag expression (?u). Specifying this flag may impose a performance penalty.
-
Pattern.UNIX_LINES
Enables Unix lines mode. In this mode, only the '\n' line terminator is recognized in the behavior of ., ^, and $. Unix lines mode can also be enabled via the embedded flag expression (?d).
To compile a pattern with multiple flags, separate the flags to be included using the bitwise OR operator "|". For clarity, the following code samples use both
case insensitive
and multiline flag:
Pattern pattern = Pattern.compile("regexpression",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
Groups and capturing
Capturing groups are numbered by counting their opening parentheses from left to right. In the expression :
((Group A)(Group B(Group C)))
for example, there are four such groups:
-
((Group A)(Group B(Group C)))
-
(Group A)
-
(Group B(Group C))
-
(Group C)
Group zero always stands for the entire expression.
|
java.regex package does not have support for named groups
|
Tags:
pattern
,
expression
,
flag
,
character
,
times
Add comment