Python 正则表达式特殊符号和字符

择一匹配符号： |

In [18]: bt = ‘bat|bet|bit’ #匹配 bat或bet或bit

In [19]: m = re.match(bt,’bat’)

In [20]: if m is not None: print m.group()
bat

In [21]: m = re.match(bt,’bot’) #匹配失败

In [22]: if m is not None: print m.group()

In [23]: m = re.search(bt,’he bit me’)

In [24]: if m is not None: print m.group()
bit

匹配任意单个字符（包括空格，除了换行符\n)：.

In [25]: anyend = ‘.end’

In [26]: m = re.match(anyend,’bend’) #匹配任意字符

In [28]: if m is not None:print m.group()
bend

In [29]: m = re.match(anyend,’end’) #不匹配无字符

In [30]: if m is not None:print m.group()

In [31]: m = re.match(anyend,’\nend’) #不匹配换行

In [32]: if m is not None:print m.group()

In [33]: m = re.match(anyend,’ end’) #匹配‘ ’

In [34]: if m is not None:print m.group()
end #end前有一个空格‘ ’

tip:若想搜索真正的点号可以通过 \ 转义

创建字符集: [ ]

例如：[abc]匹配a或b或c中任意一个字符

In [35]: m = re.match(‘[cr][23][dp][o2]‘,’r2d2’) #匹配成功

In [36]: if m is not None: print m.group()
r2d2

In [37]: m = re.match(‘[cr][23][dp][o2]‘,’c3do’)

In [38]: if m is not None: print m.group()
c3do

In [39]: m = re.match(‘[cr][23][dp][o2]‘,’c3eo’) #匹配失败

In [40]: if m is not None: print m.group()

分组

In [41]: m = re.match(‘(\w\w\w)-(\d\d\d)’,’abc-123’)

In [42]: m.group()
Out[42]: ‘abc-123’

In [43]: m.group(1) #子组1
Out[43]: ‘abc’

In [44]: m.group(2) #子组2
Out[44]: ‘123’

In [45]: m.groups() #全部子组，存放与元组中
Out[45]: (‘abc’, ‘123’)

group()通常用于显示所有匹配部分，但也可以用于取各个子组 groups()可以用来获取一个包含所有匹配字符串的元组

In [47]: m = re.match(‘ab’,’ab’) #不存在子组

In [48]: m.group()
Out[48]: ‘ab’

In [49]: m.groups() #不存在子组时无法匹配
Out[49]: ()

In [51]: m = re.match(‘(ab)’,’ab’) #一个子组

In [52]: m.group()
Out[52]: ‘ab’

In [53]: m.group(1)
Out[53]: ‘ab’

In [54]: m.groups()
Out[54]: (‘ab’,)

In [55]: m = re.match(‘(a(b))’,’ab’) #两个嵌套子组

In [56]: m.group()
Out[56]: ‘ab’

In [57]: m.group(1)
Out[57]: ‘ab’

In [58]: m.group(2)
Out[58]: ‘b’

In [59]: m.groups()
Out[59]: (‘ab’, ‘b’)

匹配起始结尾、边界单词

例：(tips该操作符多用于search而不是match)

^form 匹配以form作为起始的字符串
form$ 匹配以form为结尾的字符串
^form$ 等价于匹配单个form字符串
\b 匹配边界字符串
\B 匹配非边界字符串

In [81]: m = re.search(‘^is’,’is a dog’)

In [82]: if m is not None: print m.group()
is

In [83]: m = re.search(‘is$’,’is a dog’)

In [84]: if m is not None: print m.group()

In [85]: m = re.search(‘^dog$’,’is a dog’) # ^ $一起时必须完全匹配

In [86]: if m is not None: print m.group()

In [89]: m = re.search(‘^is a dog$’,’is a dog’)

In [90]: if m is not None: print m.group()
is a dog

前后出现空格或者换行时都属于\b的边界匹配

In [91]: m = re.search(r’\bdog’,’is a dog’) #匹配空格作为边界

In [92]: if m is not None: print m.group()
dog

In [93]: m = re.search(r’\bdog’,’is a\ndog’) #匹配换行符作为边界

In [94]: if m is not None: print m.group()
dog

In [95]: m = re.search(r’\bog’,’is a dog’) #匹配失败

In [96]: if m is not None: print m.group()

In [97]: m = re.search(r’\Bog’,’is a dog’) #\B匹配非边界字符串

In [98]: if m is not None: print m.group()
og

tips:使用r’xxx’的原始字符串避免正则匹配时的转义

脱字符 ^

直接使用表示匹配字符串的起始部分紧跟在左括号右边表示不匹配给定字符集，例如

[^\n] 不匹配换行符
[^aeiou]不匹配元音字符

In [11]: re.findall(‘[^aeiou]‘,’abcdefg’)
Out[11]: [‘b’, ‘c’, ‘d’, ‘f’, ‘g’]

拓展符号

`(?iLmsux)`用以标记并实现某些功能：

re.I/IGNORECASE (忽略大小写的匹配)

In [5]: re.findall(r’(?i)yes’,’Yes?yes.Yes!!’) #以(?i)形式表示
Out[5]: [‘Yes’, ‘yes’, ‘Yes’]

In [9]: re.findall(r’(?i)th\w+’,’The?tHoEs.tHat!!’) #匹配th开头且忽略大小写
Out[9]: [‘The’, ‘tHoEs’, ‘tHat’]

re.M/MULTILINE（进行跨行搜索）

In [10]: re.findall(r’(?m)(^th[\w ]+)’,””” #匹配th开头的段落
…: this is first line,
…: another line,
…: that line,it’s the best
…: “””)
Out[10]: [‘this is first line’, ‘that line’]

re.S/DOTALL (使点号 . 可以用来表示换行符\n)

In [12]: re.findall(r’(?s)(th.+)’,””” #成功匹配到\n
…: this is first line,
…: another line,
…: that line,it’s the best
…: “””)
Out[12]: [“this is first line,\nanother line,\nthat line,it’s the best\n”]

re.X/VERBOSE (抑制正则表达式中的空白符，以创建更易读的正则表达式)

tips:空格符可用[ ]等字符类代替，并且可以在正则中通过井号#来注释

In [68]: re.search(r’’’(?x)
…: $(\d{3})$ #区号
…: [ ] #空白符
…: (\d{3}) #前缀
…: - #横线
…: (\d{4}) #终点数字’’’,
…: ‘(800) 555-1212’).groups()
Out[68]: (‘800’, ‘555’, ‘1212’)

（?:）对正则表达式进行分组，但不保存改分组 常用于需要对改分组进行+*等操作，但又不需要将改分组提取的情况

In [11]: m=re.match(r’http://(?:\w+\.)*(\w+\.com)’,’http://code.google.com')

#使用(?:)时，groups中并未出现code

In [12]: m.groups()
Out[12]: (‘google.com’,)

In [13]: m=re.match(r’http://(\w+\.)*(\w+\.com)’,’http://code.google.com')

#未使用(?:)时，groups中出现code

In [14]: m.groups()
Out[14]: (‘code.’, ‘google.com’)

In [2]: re.findall(r’http://(?:\w+\.)*(\w+\.com)’,
…: ‘http://google.com http://www.google.com http://code.google.com') #仅提取域名
Out[2]: [‘google.com’, ‘google.com’, ‘google.com’]

`(?P<name>)` 使用自定义表示符，而并非1至N递增

（?P=name）用以引用前面的标记匹配。但必须是完全匹配，不是正则匹配。用\g来使用该分组

In [32]: print re.sub(r’(?P\d{3})-(?P\d{4})-(?P\d{4})’,
…: ‘\g-\g-\g‘,’111-2222-3333’) #分别用first,seond,third标记三个分组，并进行替换
2222-3333-111

In [35]: print re.sub(r’(?P\d{3})-(?P\d{4})-(?P=second)‘,
…: ‘\g-\g-\g‘,’111-2222-2222’) #还可以用（?P=name）用以引用前面的标记匹配
2222-2222-111

In [41]: re.search(r’(?P\d{3})-(?P\d{4})-(?P\d{4})’,’111-2222-3333’).groupdict()
Out[41]: {‘first’: ‘111’, ‘second’: ‘2222’, ‘third’: ‘3333’} #使用groupdict方法，返回一个字典

`（?=）`正向前视匹配断言

所谓的前视，就是往正则匹配方向的前方，我们所谓的后方进行判断。通过添加一些判断条件，使匹配更加精准。匹配后续内容等于（？=）中内容的字符串

In [53]: re.findall(r’\w+(?=@google.com)’,’’’ #匹配域名为google.com的用户名
…: host@baidu.com
…: root@google.com
…: user@qq.com
…: admin@google.com
…: ‘’’)
Out[53]: [‘root’, ‘admin’]

`(?!)`负向前视断言

匹配后续内容不等于（？！）中内容的字符串

In [73]: re.findall(r’(?m)^\w+@(?!google.com)’,’’’ #匹配域名不为google的用户名
…: host@baidu.com
…: root@google.com
…: user@qq.com
…: admin@google.com
…: ‘’’)
Out[73]: [‘host@’, ‘user@’]

非贪婪匹配？

贪婪匹配：正则表达式通过从左至右，试图尽可能多的获取匹配字符通过在 * + 后使用？进行非贪婪匹配

In [30]: data =’abcdefg123456789-6-8’ #例如我们仅想提取后面的数字串

In [31]: re.match(‘.+(\d+-\d+-\d+)’,data).group(1) #发现仅提取了最后的9，因为贪婪匹配导致前面几位数被.+所匹配
Out[31]: ‘9-6-8’

In [32]: re.match(‘.+?(\d+-\d+-\d+)’,data).group(1) #在加号 + 后加上问号？，表示对 + 进行非贪婪匹配
Out[32]: ‘123456789-6-8’

Python 正则表达式特殊符号和字符

择一匹配符号： |

匹配任意单个字符（包括空格，除了换行符\n)：.

创建字符集: [ ]

分组

匹配起始结尾、边界单词

脱字符 ^

拓展符号

`(?iLmsux)`用以标记并实现某些功能：

re.I/IGNORECASE (忽略大小写的匹配)

re.M/MULTILINE（进行跨行搜索）

re.S/DOTALL (使点号 . 可以用来表示换行符\n)

re.X/VERBOSE (抑制正则表达式中的空白符，以创建更易读的正则表达式)

`(?P<name>)` 使用自定义表示符，而并非1至N递增

`（?=）`正向前视匹配断言

`(?!)`负向前视断言

非贪婪匹配？

FEATURED TAGS

FRIENDS

择一匹配符号： |

匹配任意单个字符（包括空格，除了换行符\n)：.

创建字符集: [ ]

分组

匹配起始结尾、边界单词

脱字符 ^

拓展符号

**(?iLmsux)**用以标记并实现某些功能：

re.I/IGNORECASE (忽略大小写的匹配)

re.M/MULTILINE（进行跨行搜索）

re.S/DOTALL (使点号 . 可以用来表示换行符\n)

re.X/VERBOSE (抑制正则表达式中的空白符，以创建更易读的正则表达式)

(?P<name>) 使用自定义表示符，而并非1至N递增

（?=）正向前视匹配断言

(?!)负向前视断言

非贪婪匹配 ？

FEATURED TAGS

FRIENDS

`(?iLmsux)`用以标记并实现某些功能：

`(?P<name>)` 使用自定义表示符，而并非1至N递增

`（?=）`正向前视匹配断言

`(?!)`负向前视断言

非贪婪匹配？