地址文本相关性任务在现实世界中存在着广泛的应用场景,如:基于地理信息搜索的地理位置服务、对于突发事件位置信息的快速搜索定位、不同地址信息系统的对齐等等。所以本项目应用场景很广法,意义特别重大。
☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

赛题地址: https://tianchi.aliyun.com/competition/entrance/531901/information
地址文本相关性任务在现实世界中存在着广泛的应用场景,如:基于地理信息搜索的地理位置服务、对于突发事件位置信息的快速搜索定位、不同地址信息系统的对齐等等。
日常生活中输入的地址文本可以为以下几种形式:
本评测任务为基于地址文本的相关性任务。即对于给定的一个地址query以及若干个候选地址文本,参赛系统需要对query与候选地址文本的匹配程度进行打分。
多样化的地址文本写法对地址文本的相关性任务提出的挑战如下:
输入:输入文件包含若干个query-地址文本对
输出:输出文本每一行包括此query-地址文本对的匹配程度,分为完全匹配、部分匹配、不匹配
!pip install paddlenlp
# !unzip 'data/data95669/“英特尔创新大师杯”深度学习挑战赛 赛道3:CCKS2021中文NLP地址相关性任务.zip'
# !mv 'б░╙в╠╪╢√┤┤╨┬┤є╩ж▒нб▒╔ю╢╚╤з╧░╠Ї╒╜╚№ ╚№╡└3г║CCKS2021╓╨╬─NLP╡╪╓╖╧р╣╪╨╘╚╬╬ё' dataset
# !mv 'dataset/╕ё╩╜╫╘▓щ╜┼▒╛.py' check.py
!head dataset/Xeon3NLP_round1_train_20210524.txt
{"text_id": "e225b9fd36b8914f42c188fc92e8918f", "query": "河南省巩义市新华路街道办事处桐和街6号钢苑新区3号楼一单元", "candidate": [{"text": "巩义市桐和街", "label": "不匹配"}, {"text": "桐和街依家小店", "label": "不匹配"}, {"text": "桐和街CHANG六LIULIU", "label": "不匹配"}, {"text": "桐和街佳乐钢琴", "label": "不匹配"}, {"text": "世博领秀城南门桐和街囍饭食堂", "label": "不匹配"}]}
{"text_id": "b2418ead7b48db4c09caa2934843c1b4", "query": "老垅坡高家组省建五公司", "candidate": [{"text": "高家巷省建五公司岳阳分公司", "label": "完全匹配"}, {"text": "建设北路346号省建5公司", "label": "不匹配"}, {"text": "老垅坡路西100米省建三公司(岳阳分公司)", "label": "不匹配"}, {"text": "寇庄西路101号省建五公司", "label": "不匹配"}, {"text": "卓刀泉南路21号省五建公司省建五公司", "label": "不匹配"}]}
{"text_id": "5fa94565eb53463fa94ece56e8356fdc", "query": "西关人和路gong商银行对过腾信医药一楼眼镜店", "candidate": [{"text": "河门口北街33号中国gong商银行(河门口支行)", "label": "不匹配"}, {"text": "河门口北街33号中国gong商银行ATM(河门口支行)", "label": "不匹配"}, {"text": "清香坪东街(食为天旁)中国gong商银行24小时自助银行", "label": "不匹配"}, {"text": "陶家渡东路209号中国gong商银行24小时自助银行(巴关河支行)", "label": "不匹配"}, {"text": "苏铁中路110号中国gong商银行24小时自助银行(清香坪支行)", "label": "不匹配"}]}
{"text_id": "10a2a7c833eea18f479a58f2d15a53b5", "query": "唐海县四农场场部王玉文", "candidate": [{"text": "场前路北50米曹妃甸区第四农场", "label": "部分匹配"}, {"text": "新区曹妃甸湿地曹妃湖东北侧曹妃甸慧钜文化创意产业园", "label": "不匹配"}, {"text": "建设大街255号四季华庭", "label": "不匹配"}, {"text": "曹妃甸区西环路", "label": "不匹配"}, {"text": "华兴路4附近曹妃歌厅(西门)", "label": "不匹配"}]}
{"text_id": "7b82872ddc84f94733f5dac408d98bce", "query": "真北路818号近铁城市广场北座二楼", "candidate": [{"text": "真北路818号近铁城市广场北座", "label": "部分匹配"}, {"text": "真北路818号近铁城市广场北座(西南2门)", "label": "部分匹配"}, {"text": "真北路818号近铁城市广场北座(西南1门)", "label": "部分匹配"}, {"text": "真北路818号近铁城市广场北座2层捞王火锅", "label": "部分匹配"}, {"text": "金沙江路1685号118广场F1近铁城市广场北座(西北门)", "label": "部分匹配"}]}
{"text_id": "24dbf73a46d68ef94bd69c8728adeed6", "query": "义亭工业区甘塘西路9号", "candidate": [{"text": "义亭镇甘塘西路9号秀颜化妆用具有限公司", "label": "部分匹配"}, {"text": "义亭镇甘塘西路9-1号义乌市恒凯玩具有限公司", "label": "部分匹配"}, {"text": "义乌市甘塘西路", "label": "不匹配"}, {"text": "黄金塘西路9号丹阳英福康电子科技有限公司", "label": "不匹配"}, {"text": "黄塘西路9号二楼卓悦国际舞蹈学院", "label": "不匹配"}]}
{"text_id": "a48b64238490d0447bd5b39e527b280f", "query": "溧水县永阳镇东山大队谢岗村31号", "candidate": [{"text": "溧水区谢岗", "label": "不匹配"}, {"text": "溧水区东山", "label": "部分匹配"}, {"text": "永阳镇中山大队", "label": "不匹配"}, {"text": "溧水区东山线", "label": "不匹配"}, {"text": "东山线附近东山林业队", "label": "不匹配"}]}
{"text_id": "adb0c58f99a60ae136daa5148fa188a7", "query": "纪家庙万兴家居D座二楼", "candidate": [{"text": "南三环西路玉泉营万兴家居D座2层德高防水", "label": "部分匹配"}, {"text": "玉泉营桥西万兴家居D座二层德国都芳漆", "label": "部分匹配"}, {"text": "花乡南三环玉泉营桥万兴家居D管4层羽翔红木", "label": "部分匹配"}, {"text": "花乡玉泉营桥西万兴家居D座3层69-71五洲装饰", "label": "部分匹配"}, {"text": "花乡乡南三环西路78号万兴国际家居广场纪家庙地区万兴家居广场", "label": "部分匹配"}]}
{"text_id": "78f05d3265d15acde1e98cda41cf4f12", "query": "江苏省南京市江宁区禄口街道欢墩山", "candidate": [{"text": "江宁区欢墩山", "label": "部分匹配"}]}
{"text_id": "9601ec765ee5a860992ec83b5743fe42", "query": "博美二期大门对面莲花新区7号地", "candidate": [{"text": "围场满族蒙古族自治县七号地", "label": "不匹配"}, {"text": "金梧桐宾馆东侧100米大屯7号地", "label": "不匹配"}, {"text": "中原路商隐路交汇处清华·大溪地七号院", "label": "不匹配"}, {"text": "滨海新区博美园7号楼", "label": "不匹配"}, {"text": "莲池区秀兰城市美地7号楼", "label": "不匹配"}]}# paddlenlp库引入import paddlenlp as ppnlpimport paddle.nn.functional as Ffrom paddlenlp.datasets import load_datasetfrom paddlenlp.data import Stack, Tuple, Padfrom paddlenlp.transformers import LinearDecayWithWarmupimport numpy as npimport osimport jsonimport timeimport paddleimport paddle.nn.functional as Ffrom paddlenlp.datasets import load_datasetimport paddlenlpfrom functools import partial
train_tmp = []
dev_tmp = []
num = 0 a = []for line in open('dataset/Xeon3NLP_round1_train_20210524.txt','r'):
t = json.loads(line) for j in t['candidate']:
l = dict()
l['query'] = str(t['query'])
l['title'] = str(j['text'])
a.append(len(l['title'])) if j['label'] == '不匹配':
l['label'] = 0
elif j['label'] == '完全匹配':
l['label'] = 2
else:
l['label'] = 1
if num <18000:
train_tmp.append(l) else:
dev_tmp.append(l)
num += 1num20000
from paddlenlp.datasets import load_datasetdef read(filename):
for line in filename: yield {'query': line['query'], 'title': line['title'], 'label':line['label']}train_ds = load_dataset(read, filename=train_tmp, lazy=False) dev_ds = load_dataset(read, filename=dev_tmp, lazy=False)
for idx, example in enumerate(train_ds): if idx <= 5: print(example)
def convert_example(example, tokenizer, max_seq_length=80, is_test=False):
query, title = example["query"], example["title"]
encoded_inputs = tokenizer(
text=query, text_pair=title, max_seq_len=max_seq_length)
input_ids = encoded_inputs["input_ids"]
token_type_ids = encoded_inputs["token_type_ids"] if not is_test:
label = np.array([example["label"]], dtype="int64") return input_ids, token_type_ids, label # 在预测或者评估阶段,不返回 label 字段
else: return input_ids, token_type_ids
pretrained_model = paddlenlp.transformers.RobertaModel.from_pretrained('roberta-wwm-ext')
tokenizer = paddlenlp.transformers.RobertaTokenizer.from_pretrained('roberta-wwm-ext'){'query': '河南省巩义市新华路街道办事处桐和街6号钢苑新区3号楼一单元', 'title': '巩义市桐和街', 'label': 0}
{'query': '河南省巩义市新华路街道办事处桐和街6号钢苑新区3号楼一单元', 'title': '桐和街依家小店', 'label': 0}
{'query': '河南省巩义市新华路街道办事处桐和街6号钢苑新区3号楼一单元', 'title': '桐和街CHANG六LIULIU', 'label': 0}
{'query': '河南省巩义市新华路街道办事处桐和街6号钢苑新区3号楼一单元', 'title': '桐和街佳乐钢琴', 'label': 0}
{'query': '河南省巩义市新华路街道办事处桐和街6号钢苑新区3号楼一单元', 'title': '世博领秀城南门桐和街囍饭食堂', 'label': 0}
{'query': '老垅坡高家组省建五公司', 'title': '高家巷省建五公司岳阳分公司', 'label': 2}[2021-07-06 20:50:19,284] [ INFO]
- Already cached /home/aistudio/.paddlenlp/models/roberta-wwm-ext/roberta_chn_base.pdparams
[2021-07-06 20:50:26,003] [ INFO] - Found /home/aistudio/.paddlenlp/models/roberta-wwm-ext/vocab.txt
import paddle.nn as nnclass PointwiseMatching(nn.Layer):
# 此处的 pretained_model 在本例中会被 ERNIE1.0 预训练模型初始化
def __init__(self, pretrained_model, dropout=None):
super().__init__()
self.ptm = pretrained_model
self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
self.classifier = nn.Linear(self.ptm.config["hidden_size"], 3) def forward(self,
input_ids,
token_type_ids=None,
position_ids=None,
attention_mask=None):
_, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids,
attention_mask)
cls_embedding = self.dropout(cls_embedding)
logits = self.classifier(cls_embedding) return logitsmodel = PointwiseMatching(pretrained_model)
# 将 1 条明文数据的 query、title 拼接起来,根据预训练模型的 tokenizer 将明文转换为 ID 数据# 返回 input_ids 和 token_type_idsdef convert_example(example, tokenizer, max_seq_length=80, is_test=False):
query, title = example["query"], example["title"]
encoded_inputs = tokenizer(
text=query, text_pair=title, max_seq_len=max_seq_length)
input_ids = encoded_inputs["input_ids"]
token_type_ids = encoded_inputs["token_type_ids"] if not is_test:
label = np.array([example["label"]], dtype="int64") return input_ids, token_type_ids, label # 在预测或者评估阶段,不返回 label 字段
else: return input_ids, token_type_ids# 为了后续方便使用,我们给 convert_example 赋予一些默认参数from functools import partial# 训练集和验证集的样本转换函数trans_func = partial(
convert_example,
tokenizer=tokenizer,
max_seq_length=80) # 512from paddlenlp.data import Stack, Pad, Tuple# 我们的训练数据会返回 input_ids, token_type_ids, labels 3 个字段# 因此针对这 3 个字段需要分别定义 3 个组 batch 操作batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids
Stack(dtype="int64") # label): [data for data in fn(samples)]# 基于 train_ds 定义 train_data_loader# 因为我们使用了分布式的 DistributedBatchSampler, train_data_loader 会自动对训练数据进行切分# 定义分布式 Sampler: 自动对训练数据进行切分,支持多卡并行训练batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=400, shuffle=True)
train_data_loader = paddle.io.DataLoader(
dataset=train_ds.map(trans_func),
batch_sampler=batch_sampler,
collate_fn=batchify_fn,
return_list=True)# 针对验证集数据加载,我们使用单卡进行评估,所以采用 paddle.io.BatchSampler 即可# 定义 dev_data_loaderbatch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=400, shuffle=False)
dev_data_loader = paddle.io.DataLoader(
dataset=dev_ds.map(trans_func),
batch_sampler=batch_sampler,
collate_fn=batchify_fn,
return_list=True)batch_size = 16# 训练过程中的最大学习率learning_rate = 2e-5 # 训练轮次epochs = 30# 学习率预热比例warmup_proportion = 0.1# 权重衰减系数,类似模型正则项策略,避免模型过拟合weight_decay = 1e-4
num_training_steps = len(train_data_loader) * epochs
lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
weight_decay=weight_decay,
apply_decay_param_fun=lambda x: x in [
p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])
])
criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()from paddlenlp.transformers import LinearDecayWithWarmup
epochs = 10num_training_steps = len(train_data_loader) * epochs# 定义 learning_rate_scheduler,负责在训练过程中对 lr 进行调度lr_scheduler = LinearDecayWithWarmup(5E-5, num_training_steps, 0.0)# Generate parameter names needed to perform weight decay.# All bias and LayerNorm parameters are excluded.decay_params = [
p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])
]# 定义 Optimizeroptimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
weight_decay=0.0,
apply_decay_param_fun=lambda x: x in decay_params)# 采用交叉熵 损失函数criterion = paddle.nn.loss.CrossEntropyLoss()# 评估的时候采用准确率指标metric = paddle.metric.Accuracy()# 加入日志显示from visualdl import LogWriter
writer = LogWriter("./log")# 因为训练过程中同时要在验证集进行模型评估,因此我们先定义评估函数@paddle.no_grad()def evaluate(model, criterion, metric, data_loader, phase="dev"):
model.eval()
metric.reset()
losses = [] for batch in data_loader:
input_ids, token_type_ids, labels = batch
probs = model(input_ids=input_ids, token_type_ids=token_type_ids)
loss = criterion(probs, labels)
losses.append(loss.numpy())
correct = metric.compute(probs, labels)
metric.update(correct)
accu = metric.accumulate() print("eval {} loss: {:.5}, accu: {:.5}".format(phase,
np.mean(losses), accu))
# 加入eval日志显示
writer.add_scalar(tag="eval/loss", step=global_step, value=np.mean(losses))
writer.add_scalar(tag="eval/acc", step=global_step, value=accu)
model.train()
metric.reset() return accu# 接下来,开始正式训练模型save_dir = "checkpoint"os.makedirs(save_dir)
global_step = 0tic_train = time.time()
pre_accu=0for epoch in range(1, epochs + 1): for step, batch in enumerate(train_data_loader, start=1):
input_ids, token_type_ids, labels = batch
probs = model(input_ids=input_ids, token_type_ids=token_type_ids)
loss = criterion(probs, labels)
correct = metric.compute(probs, labels)
metric.update(correct)
acc = metric.accumulate()
global_step += 1
# 每间隔 10 step 输出训练指标
if global_step % 10 == 0: print( "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
% (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train)))
tic_train = time.time()
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.clear_grad() # 每间隔 100 step 在验证集和测试集上进行评估
if global_step % 100 == 0:
accu=evaluate(model, criterion, metric, dev_data_loader, "dev")
# 加入train日志显示
writer.add_scalar(tag="train/loss", step=global_step, value=loss)
writer.add_scalar(tag="train/acc", step=global_step, value=acc)
if accu>pre_accu: # 加入保存
save_param_path = os.path.join(save_dir, 'model_state.pdparams')
paddle.save(model.state_dict(), save_param_path) # tokenizer.save_pretrained(save_dir)
pre_accu=accuglobal step 510, epoch: 3, batch: 76, loss: 0.31065, accu: 0.87175, speed: 0.34 step/sglobal step 520, epoch: 3, batch: 86, loss: 0.36163, accu: 0.87113, speed: 0.64 step/sglobal step 530, epoch: 3, batch: 96, loss: 0.27322, accu: 0.87517, speed: 0.64 step/sglobal step 540, epoch: 3, batch: 106, loss: 0.38036, accu: 0.87538, speed: 0.65 step/sglobal step 550, epoch: 3, batch: 116, loss: 0.30168, accu: 0.87545, speed: 0.66 step/sglobal step 560, epoch: 3, batch: 126, loss: 0.34728, accu: 0.87667, speed: 0.66 step/sglobal step 570, epoch: 3, batch: 136, loss: 0.28104, accu: 0.87754, speed: 0.65 step/sglobal step 580, epoch: 3, batch: 146, loss: 0.30536, accu: 0.87803, speed: 0.68 step/sglobal step 590, epoch: 3, batch: 156, loss: 0.35877, accu: 0.87739, speed: 0.65 step/sglobal step 600, epoch: 3, batch: 166, loss: 0.39249, accu: 0.87828, speed: 0.65 step/seval dev loss: 0.47208, accu: 0.83651
# 训练结束后,存储模型参数# save_dir = os.path.join("checkpoint", "model_%d" % global_step)# os.makedirs(save_dir)# save_param_path = os.path.join(save_dir, 'model_state111.pdparams')# paddle.save(model.state_dict(), save_param_path)tokenizer.save_pretrained(save_dir)model_dict=paddle.load('checkpoint/model_state.pdparams')
model.set_dict(model_dict)batchify_test_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids): [data for data in fn(samples)]num = 0fo = open("submit_addr_match_runid4.txt", "w")for line in open('dataset/Xeon3NLP_round1_test_20210524.txt','r', encoding='utf8'):
t = json.loads(line) for j in range(len(t['candidate'])):
tmp = []
l = dict()
l['query'] = t['query']
l['title'] = t['candidate'][j]['text']
l['label'] = 0
input_ids, token_type_ids,label = convert_example(l, tokenizer, max_seq_length=80)
tmp.append((input_ids, token_type_ids))
input_ids, token_type_ids = batchify_test_fn(tmp)
input_ids = paddle.to_tensor(input_ids)
token_type_ids = paddle.to_tensor(token_type_ids)
logits = model(input_ids, token_type_ids)
idx = paddle.argmax(logits, axis=1).numpy() if idx[0] == 0:
t['candidate'][j]['label'] = '不匹配'
elif idx[0] == 1:
t['candidate'][j]['label'] = '部分匹配'
else:
t['candidate'][j]['label'] = '完全匹配'
# if num < 10:
# print(t)
# print(t['candidate'][j]['label'] )
if num % 500 == 0: print(num)
fo.write(json.dumps(t,ensure_ascii=False))
fo.write('\n')
num +=1print(40*'*')print(f"num: f{num}")
fo.close()0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 9500 10000 10500 11000 11500 12000 12500 13000 13500 14000 14500 **************************************** num: f15000
import jsonimport linecachedef check(submit_path, test_path, max_num=15000):
'''
:param submit_path: 选手提交的文件名
:param test_path: 原始测试数据名
:param max_num: 测试数据大小
:return:
'''
N = 0
with open(submit_path, 'r', encoding='utf-8') as fs: for line in fs: try:
line = line.strip() if line == '': continue
N += 1
smt_jsl = json.loads(line)
test_jsl = json.loads(linecache.getline(test_path, N)) if set(smt_jsl.keys()) != {'text_id', 'query', 'candidate'}: raise AssertionError(f'请保证提交的JSON数据的所有key与测评数据一致! {line}') elif smt_jsl['text_id'] != test_jsl['text_id']: raise AssertionError(f'请保证text_id和测评数据一致,并且不要改变数据顺序!text_id: {smt_jsl["text_id"]}') elif smt_jsl['query'] != test_jsl['query']: raise AssertionError(f'请保证query内容和测评数据一致!text_id: {smt_jsl["text_id"]}, query: {smt_jsl["query"]}') elif len(smt_jsl['candidate']) != len(test_jsl['candidate']): raise AssertionError(f'请保证candidate的数量和测评数据一致!text_id: {smt_jsl["text_id"]}') else: for smt_cand, test_cand in zip(smt_jsl['candidate'], test_jsl['candidate']): if smt_cand['text'] != test_cand['text']: raise AssertionError(f'请保证candidate的内容和顺序与测评数据一致!text_id: {smt_jsl["text_id"]}, candidate_item: {smt_cand["text"]}') except json.decoder.JSONDecodeError: raise AssertionError(f'请检查提交数据的JSON格式是否有误!如:用键-值用双引号 {line}') except KeyError: raise AssertionError(f'请保证提交的JSON数据的所有key与测评数据一致!{line}') if N != max_num: raise AssertionError(f"请保证测试数据的完整性(共{max_num}条),不可丢失或增加数据!") print('Well Done !!')
check('submit_addr_match_runid4.txt', 'dataset/Xeon3NLP_round1_test_20210524.txt')Well Done !!
大概2个epoch,成绩约为0.76
以上就是基于PaddleNLP的CCKS2021中文地址相关性竞赛baseline的详细内容,更多请关注php中文网其它相关文章!
每个人都需要一台速度更快、更稳定的 PC。随着时间的推移,垃圾文件、旧注册表数据和不必要的后台进程会占用资源并降低性能。幸运的是,许多工具可以让 Windows 保持平稳运行。
Copyright 2014-2025 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号