learnxinyminutes-docs/pcre.md at 912da583da39774e06731db39b046529e20ffa71

iskm/learnxinyminutes-docs

Fork 0

mirror of https://github.com/adambard/learnxinyminutes-docs.git synced 2024-12-23 09:41:36 +00:00

Boris Verkhovskiy 434f15cd10 .markdown -> .md

2024-12-08 20:03:01 -07:00

3.3 KiB

Raw Blame History

language

filename

contributors

PCRE

pcre.txt

Sachin Divekar

http://github.com/ssd532

A regular expression (regex or regexp for short) is a special text string for describing a search pattern. e.g. to extract the protocol from a url string we can say /^[a-z]+:/ and it will match http: from http://github.com/.

PCRE (Perl Compatible Regular Expressions) is a C library implementing regex. It was written in 1997 when Perl was the de-facto choice for complex text processing tasks. The syntax for patterns used in PCRE closely resembles Perl. PCRE syntax is being used in many big projects including PHP, Apache, R to name a few.

There are two different sets of metacharacters:

Those that are recognized anywhere in the pattern except within square brackets

  \      general escape character with several uses
  ^      assert start of string (or line, in multiline mode)
  $      assert end of string (or line, in multiline mode)
  .      match any character except newline (by default)
  [      start character class definition
  |      start of alternative branch
  (      start subpattern
  )      end subpattern
  ?      extends the meaning of (
         also 0 or 1 quantifier
         also quantifier minimizer
  *      0 or more quantifier
  +      1 or more quantifier
         also "possessive quantifier"
  {      start min/max quantifier

Those that are recognized within square brackets. Outside square brackets. They are also called as character classes.

  \      general escape character
  ^      negate the class, but only if the first character
  -      indicates character range
  [      POSIX character class (only if followed by POSIX syntax)
  ]      terminates the character class

PCRE provides some generic character types, also called as character classes.

  \d     any decimal digit
  \D     any character that is not a decimal digit
  \h     any horizontal white space character
  \H     any character that is not a horizontal white space character
  \s     any white space character
  \S     any character that is not a white space character
  \v     any vertical white space character
  \V     any character that is not a vertical white space character
  \w     any "word" character
  \W     any "non-word" character

Examples

We will test our examples on the following string:

66.249.64.13 - - [18/Sep/2004:11:07:48 +1000] "GET /robots.txt HTTP/1.0" 200 468 "-" "Googlebot/2.1"

It is a standard Apache access log.

Regex	Result	Comment
`GET`	GET	GET matches the characters GET literally (case sensitive)
`\d+.\d+.\d+.\d+`	66.249.64.13	`\d+` match a digit [0-9] one or more times defined by `+` quantifier, `\.` matches `.` literally
`(\d+\.){3}\d+`	66.249.64.13	`(\d+\.){3}` is trying to match group (`\d+\.`) exactly three times.
`\[.+\]`	[18/Sep/2004:11:07:48 +1000]	`.+` matches any character (except newline), `.` is any character
`^\S+`	66.249.64.13	`^` means start of the line, `\S+` matches any number of non-space characters
`\+[0-9]+`	+1000	`\+` matches the character `+` literally. `[0-9]` character class means single number. Same can be achieved using `\+\d+`

3.3 KiB Raw Blame History

Examples

Further Reading

3.3 KiB

Raw Blame History