python正则表达式进阶版—分组与捕获

前言

先介绍我们要实现什么效果：用正则表达式进行匹配，并用匹配的内容一部分进行拼接，然后替换这个正则匹配到的内容。听起来是不是很难懂？我在网上查了一下，这种行为应该叫“分组和捕获”，用例子来说明可能更清晰：

<link rel="pingback" href="https://www.sitstars.com/action/xmlrpc" />
<link rel="EditURI" type="application/rsd+xml" title="RSD" href="https://www.sitstars.com/action/xmlrpc?rsd" />
<link rel="wlwmanifest" type="application/wlwmanifest+xml" href="https://www.sitstars.com/action/xmlrpc?wlw" />
<link rel="alternate" type="application/rss+xml" title="python+百度云api写一个文字识别程序 » Shu's Garden » RSS 2.0" href="https://www.sitstars.com/feed/archives/39/" />
<link rel="alternate" type="application/rdf+xml" title="python+百度云api写一个文字识别程序 » Shu's Garden » RSS 1.0" href="https://www.sitstars.com/feed/rss/archives/39/" />
<link rel="alternate" type="application/atom+xml" title="python+百度云api写一个文字识别程序 » Shu's Garden » ATOM 1.0" href="https://www.sitstars.com/feed/atom/archives/39/" />
其他内容……

假如我想提取出<>中的rel="#"和href="#"，然后把#拼接到一起（用逗号分隔），以替换<>中的内容，也就是说，最后的文字应该是这样

pingback,https://www.sitstars.com/action/xmlrpc
EditURI,https://www.sitstars.com/action/xmlrpc?rsd
wlwmanifest,https://www.sitstars.com/action/xmlrpc?wlw
alternate,https://www.sitstars.com/feed/archives/39/
alternate,https://www.sitstars.com/feed/rss/archives/39/
alternate,https://www.sitstars.com/feed/atom/archives/39/
其他内容……

应该怎么做？

匹配的话很容易，<\w+\srel="(\w+)".+href="(\S+)".+>就行（随手写的，正则大佬勿喷），注意我用括号括起来的，就是我们所需要的两个组。

在一些工具（比如站长工具的在线正则表达式测试）中，可以用$1,$2来替换，也即用$1来匹配捕获第一组，$2来捕获第二组。不过大部分工具貌似都没有这个功能……

问题来源于今晚我整理笔记的时候，从word复制过来的文档转为markdown格式后，脚注出现了一些莫名其妙的符号，想在vnote中进行正则批量替换，然而发现vnote不支持分组。那word呢？很遗憾我发现word对正则的支持都很少，至少我把正则表达式放上去后，根本搜不到内容。那怎么办？自然而然地，我想到了python，用python应该可以轻易替换吧？

代码很快写完，也进行了批量替换（具体代码可参见markdown图片链接自动替换）。但是我却陷入了沉思：这么重要的功能为什么python没有？还要写这么一段代码才能间接实现（虽然python的re.match和re.search有个group()属性可以进行分组，但是我实在不知道该如何在不用循环的情况下进行批量替换）。联系到上次我做脚本的时候也是苦思冥想了一会才写出这段代码。为什么不把它也写成一个程序呢？以后直接调用就行，可以节省大量时间。说干就干，打开python开始写(mo)代(yu)码(zhong)。

代码

import re

def regular_mode(regulation,sub_org,file_path):
    sub_list = []  # 需要替换的数字列表
    sub_copy = sub_org
    while re.search('!@\d',sub_copy):
        sub_count = re.search('!@\d',sub_copy).group() # 查询'!@\d'     
        sub_index = sub_copy.index(sub_count) # 找到'!@\d'的位置
        sub_list.append((sub_copy[int(sub_index)+2],sub_index)) # 替换数字和位置的元组
        sub_copy = sub_copy.replace(sub_count,sub_copy[int(sub_index)+2],1) # 替换'!@\d'为'\d'，以便循环
  
    pattern = re.compile(regulation)

    with open(file_path,'r+',encoding = "utf-8") as handler:
        content = handler.read() 

        handler.seek(0)
        handler.truncate()  

        while pattern.search(content):
            sub_pat = pattern.search(content)
            sub_str = ''
            start = 0
            count = 0
            for idx in sub_list:              
                sub_str += sub_org[start:idx[1]+count]+sub_pat.group(int(idx[0])) 
                # 因为上面得出的位置都是去掉前面'!@\d'之后的位置，所以需要进行一定调整
                count += 2
                start = idx[1]+count+1
            sub_str += sub_org[start:]  # 防止最后一个索引后仍然有其他字符              
            content = content.replace(sub_pat.group(),sub_str, 1)
        handler.write(content)

def main():
    regulation = '\[\\\\\[(\d+)\\\\\]\]\(#_ftnref\d+\)' # 这里定义正则规则
    sub_org = '[^!@1]:'  # 用!@来索引，这里不用考虑转义
    file_path = 'test.md'  
    regular_mode(regulation,sub_org,file_path)

if __name__ == '__main__':
    main()

这样，我们可以随意分组匹配并拼接替换。这里用!@来定义索引，那么上述例子可以用!@1,!@2来进行匹配替换。嗯！完美。打开博客就准备开始水一篇文章。

更简洁的代码

没错，想必你也发现了，这篇文章的开头没有放github地址，也没有讲我的思路（尽管我花了两个多小时才写出来）。是这样的，正当我准备写文章的时候，突然灵光一闪，想起来以前学崔庆才老师的爬虫教程时，他好像也用到了正则表达式，看看他是怎么解决这个问题的吧。

于是我打开了笔记。

于是我呆住了。

对，明明有更简单的方案的！先上代码：

import re

pattern = re.compile('<\w+\srel="(\w+)".+href="(\S+)".+>')

with open('test.txt','r+',encoding = "gbk") as handler:
    content = handler.read() 
    items = pattern.findall(content)
    for item in items:
        sub_str = item[0]+','+item[1]
        content = content.replace(pattern.search(content).group(),sub_str,1)
    print(content)

是这样的，如果正则中出现了括号，那么就开始进行分组，此时re.findall不再是列出正则的匹配对象，而是列出由各分组组成的元组！还是以开头的例子为例，用re.findall匹配后，结果如下：

[('pingback', 'https://www.sitstars.com/action/xmlrpc'), 
('EditURI', 'https://www.sitstars.com/action/xmlrpc?rsd'), 
('wlwmanifest', 'https://www.sitstars.com/action/xmlrpc?wlw'),
('alternate', 'https://www.sitstars.com/feed/archives/39/'),
('alternate', 'https://www.sitstars.com/feed/rss/archives/39/'),
('alternate', 'https://www.sitstars.com/feed/atom/archives/39/')]

这样的话，我们就可以很容易地进行组合替换了，虽然还是要用到循环，但是这种方案无疑简单得多，也高效得多。

居然为了这个功能花了那么久……还是自己的python基础不过关呐。不多说了，我自闭了。

哦，对，笔记还是没有整理完QAQ。

更新

我还是太蠢了，在python的正则表达式里，\1代表第一组，\2代表第二组，也就是说，上面的任务可以用下面代码实现：

import re

pattern = re.compile('<\w+\srel="(\w+)".+href="(\S+)".+>')

with open('test.txt','r+',encoding = "gbk") as handler:
    content = handler.read()
    content = pattern.sub(r'\1,\2',content)

版权属于：作者名称
本文链接：https://www.sitstars.com/archives/42/
转载时须注明出处及本声明

页面

分类

python正则表达式进阶版—分组与捕获

前言

代码

更简洁的代码

更新

添加新评论取消回复

python正则表达式进阶版—分组与捕获