python计算地址相似度以及抽取省市区信息的库

前言

平时工作上会经常处理地理数据上关于地址地名的相似度计算，或者从地址中抽取省市区信息的内容，所以记录一下一些好用的python库。

[MGeo应用]使用AI模型比较地址相似度

# pip install cryptography
# pip install "modelscope[nlp]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

/opt/conda/envs/torch_nlp/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
2023-11-03 11:06:46,720 - modelscope - INFO - PyTorch version 1.12.0 Found.
2023-11-03 11:06:46,722 - modelscope - INFO - TensorFlow version 2.13.1 Found.
2023-11-03 11:06:46,723 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
2023-11-03 11:06:46,724 - modelscope - INFO - No valid ast index found from /root/.cache/modelscope/ast_indexer, generating ast index from prebuilt!
2023-11-03 11:06:46,772 - modelscope - INFO - Loading done! Current index file version is 1.9.4, with md5 568af3316338b57f34c128d45d342278 and a total number of 945 components indexed

task = Tasks.sentence_similarity
model = 'damo/mgeo_geographic_entity_alignment_chinese_base'
inputs = ('紫萱路363号人力社保局', '紫萱路363号市人社局')
pipeline_ins = pipeline(task=task, model=model, model_revision='v1.2.0')
print(pipeline_ins(input=inputs))

2023-11-03 11:07:16,653 - modelscope - INFO - Use user-specified model revision: v1.2.0
Downloading: 100%|██████████| 1.57k/1.57k [00:00<00:00, 221kB/s]
Downloading: 100%|██████████| 2.21k/2.21k [00:00<00:00, 311kB/s]
Downloading: 100%|██████████| 390M/390M [02:51<00:00, 2.38MB/s] 
Downloading: 100%|██████████| 9.92k/9.92k [00:00<00:00, 1.44MB/s]
Downloading: 100%|██████████| 107k/107k [00:00<00:00, 952kB/s]
2023-11-03 11:10:12,186 - modelscope - INFO - initiate model from /root/.cache/modelscope/hub/damo/mgeo_geographic_entity_alignment_chinese_base
2023-11-03 11:10:12,188 - modelscope - INFO - initiate model from location /root/.cache/modelscope/hub/damo/mgeo_geographic_entity_alignment_chinese_base.
2023-11-03 11:10:12,191 - modelscope - INFO - initialize model from /root/.cache/modelscope/hub/damo/mgeo_geographic_entity_alignment_chinese_base
2023-11-03 11:10:12.471156: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-03 11:10:12.473218: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-03 11:10:12.510180: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-03 11:10:12.511289: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-03 11:10:13.174364: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-11-03 11:10:15,694 - modelscope - INFO - The key of sentence1: sentence1, The key of sentence2: sentence2, The key of label: label
2023-11-03 11:10:15,696 - modelscope - INFO - cuda is not available, using cpu instead.
2023-11-03 11:10:15,697 - modelscope - INFO - The key of sentence1: text, The key of sentence2: None, The key of label: label


{'scores': [0.06451418250799179, 0.9217355251312256, 0.013750356622040272], 'labels': ['partial_match', 'exact_match', 'not_match']}

[MGeo应用]使用AI模型拆分地址的省市区街道

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

task = Tasks.token_classification
model = 'damo/mgeo_geographic_elements_tagging_chinese_base'
inputs = '浙江省杭州市余杭区阿里巴巴西溪园区'
pipeline_ins = pipeline(task=task, model=model)
print(pipeline_ins(input=inputs))

2023-11-03 11:20:00,441 - modelscope - WARNING - Model revision not specified, use revision: v1.1.1
Downloading: 100%|██████████| 1.50k/1.50k [00:00<00:00, 495kB/s]
Downloading: 100%|██████████| 8.23k/8.23k [00:00<00:00, 2.30MB/s]
Downloading: 100%|██████████| 388M/388M [02:21<00:00, 2.87MB/s] 
Downloading: 100%|██████████| 7.81k/7.81k [00:00<00:00, 2.14MB/s]
Downloading: 100%|██████████| 107k/107k [00:00<00:00, 1.15MB/s]
2023-11-03 11:22:26,351 - modelscope - INFO - initiate model from /root/.cache/modelscope/hub/damo/mgeo_geographic_elements_tagging_chinese_base
2023-11-03 11:22:26,353 - modelscope - INFO - initiate model from location /root/.cache/modelscope/hub/damo/mgeo_geographic_elements_tagging_chinese_base.
2023-11-03 11:22:26,356 - modelscope - INFO - initialize model from /root/.cache/modelscope/hub/damo/mgeo_geographic_elements_tagging_chinese_base
2023-11-03 11:22:28,128 - modelscope - INFO - cuda is not available, using cpu instead.
2023-11-03 11:22:28,139 - modelscope - WARNING - task token-classification input definition is missing
2023-11-03 11:22:28,266 - modelscope - WARNING - task token-classification output keys are missing


{'output': [{'type': 'prov', 'start': 0, 'end': 3, 'prob': 0.008959547, 'span': '浙江省'}, {'type': 'city', 'start': 3, 'end': 6, 'prob': 0.00046638038, 'span': '杭州市'}, {'type': 'district', 'start': 6, 'end': 9, 'prob': 0.0005808351, 'span': '余杭区'}, {'type': 'poi', 'start': 9, 'end': 17, 'prob': 0.0004953872, 'span': '阿里巴巴西溪园区'}]}

使用cpca库来提取地址的省市区信息

# cpca（cpca是chinese province city area的缩写），是一个用于识别简体中文字符串中省,市和区并能够进行映射，检验和简单绘图的python库
# pip install cpca

import cpca 
location_str = ["徐汇区虹漕路461号58号楼5楼", "泉州市洛江区万安塘西工业区", "朝阳区北苑华贸城"]
df = cpca.transform(location_str)
df

	省	市	区	地址	adcode
0	上海市	市辖区	徐汇区	虹漕路461号58号楼5楼	310104
1	福建省	泉州市	洛江区	万安塘西工业区	350504
2	吉林省	长春市	朝阳区	北苑华贸城	220104

使用similarities库来计算相似度

from similarities import Similarity
simi = Similarity(model_name_or_path="/data/project/backup/model/text2vec-base-chinese")

def calc_similarity(x, y):
    """一对一句子相似度"""
    return simi.similarity(x, y).item()


def calc_multi_similarities(X, Y):
    """多对多句子的相似度"""
    scores = simi.similarity(X, Y)
    scores = scores.numpy()
    return scores

[32m2023-11-03 14:38:41.139[0m | [34m[1mDEBUG   [0m | [36mtext2vec.sentence_model[0m:[36m__init__[0m:[36m80[0m - [34m[1mUse device: cpu[0m

calc_similarity('方兴路10号', '方新路1号')

0.8061661720275879

X = ['方兴路10号','方新路1号']
Y=['龙池街道方新路1号(化工园地铁站1号口步行290米)',
'方兴路10号金盛国际家居',
'龙池街道方新路1号(化工园地铁站1号口步行290米)',
'陶瓷二街金盛建材家具城',
'综合大道与卫浴大道交叉口西40米',
'龙池街道方新路1号附近',
'方兴路10号',
'方新路1号',
'陶瓷一街与金盛大道交叉口北40米']
calc_multi_similarities(X, Y)

array([[0.5751205 , 0.8057978 , 0.5751205 , 0.5116297 , 0.4892777 ,
        0.6687851 , 1.        , 0.8061661 , 0.51308894],
       [0.6791094 , 0.7076333 , 0.6791094 , 0.5441158 , 0.5032858 ,
        0.7903277 , 0.806166  , 0.99999994, 0.54909706]], dtype=float32)

参考文章：
1.[MGeo应用]使用python+AI模型拆分Excel中地址的省市区街道
2.[MGeo应用]使用Python+AI模型比较地址相似度
3.一日一技：如何使用python来提取省市区信息？
4.Similarities: Similarity Calculation and Semantic Search