如何在动态获取记录时改进正则表达式

数据挖掘 Python 正则表达式
2022-03-16 21:17:41

数据如下:

COL1    COL2

 12    :402:agsh,hhjd,:45:hghgh,gruru,:12:fgh,ghgh,:22:hhhh
 57    :42:agshhhjd,:57:hghgh,gruru,:120:fghghgh,:12:hhhhhh

我正在创建第三列field_info,例如:

 COL1  COL2                                                    field_info

 12   :402:agsh,hhjd,:45:hghghgruru,:12:fgh,ghgh,:22:hhhh      fgh,ghg
 57   :42:agshhhjd :57:hghgh,gruru:120:fghghgh :12:hhhhhh    hghgh,g

我正在使用如下正则表达式函数:

df.loc[:,'field_info']=df.col2.replace(regex=r'.*'+ df.col1.astype('str') +':(.{15}).*',value="\\1")

我有 2 列 col1 和 col2。col1 有一些值,我在 col2 中动态搜索并从中提取接下来的 15 个字符。但是,这需要很多时间。任何人都可以建议一种更快的方法吗?

2个回答
string = ':402:agsh,hhjd,:45:hghghgruru,:12:fgh,ghgh,:22:hhhh'
place = string.find('12')
def extract_substring(string, num):
    starting_point = place + len('12')
    return string[starting_point:(starting_point + 15)]
df.apply(lambda row:extract_substring(row['col2'], row['col1']), axis=1)
%timeit df.loc[:,'field_data']=df.col2.replace(regex=r'.*'+ df.col1.astype('str') +':(.{15}).*',value="\\1")

应该也可以,并且不使用正则表达式

根据您的样本数据,我复制了 50000 次,结果如下 -

>>> df = pd.DataFrame({'COL1':[12 ,57],'COL2': [':402:agsh,hhjd,:45:hghgh,gruru,:12:fgh,ghgh,:22:hhhh',':42:agshhhjd,:57:hghgh,gruru,:120:fghghgh,:12:hhhhhh']})

>>> for _ in range(50000):
        df = df.append({'COL1':12,'COL2': ':402:agsh,hhjd,:45:hghgh,gruru,:12:fgh,ghgh,:22:hhhh'}, ignore_index = True)
        df = df.append({'COL1':57,'COL2': ':42:agshhhjd,:57:hghgh,gruru,:120:fghghgh,:12:hhhhhh'}, ignore_index = True)

>>> df.shape 
(100002, 2)

然后我定义了一个自定义函数并应用于列-

>>> def somefunc(x,y):
        res = []
        for i in range(len(x)):
            ix = y[i].find(x[i]) + len(x[i])
            res.append(y[i][ix+1:ix+8])
        return res

>>> df['col3'] = somefunc(df['COL1'].astype(str),df['COL2'])
>>> df.head()
    COL1                                   COL2             col3 
0    12  :402:agsh,hhjd,:45:hghgh,gruru,:12:fgh,ghgh,:2...  fgh,ghg
1    57  :42:agshhhjd,:57:hghgh,gruru,:120:fghghgh,:12:...  hghgh,g
2    12  :402:agsh,hhjd,:45:hghgh,gruru,:12:fgh,ghgh,:2...  fgh,ghg
3    57  :42:agshhhjd,:57:hghgh,gruru,:120:fghghgh,:12:...  hghgh,g

我没有使用正则表达式,这个函数需要将近 5 秒才能完成 100000 行。