该内容围绕中文重复问题识别赛事展开,使用ERNIE模型构建识别算法。读取约5千条训练集数据,对问题对进行编码处理,划分训练集和验证集。定义模型超参数,通过批次训练模型,计算损失并优化,同时进行验证评估准确率。最后用训练好的模型对测试集预测,生成提交文件。
☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

问答系统中包括三个主要的部分:问题理解,信息检索和答案抽取。而问题理解是问答系统的第一部分也是非常关键的一部分。问题理解有非常广泛的应用,如重复评论识别、相似问题识别等。
重复问题检测是一个常见的文本挖掘任务,在很多实际问答社区都有相应的应用。重复问题检测可以方便进行问题的答案聚合,以及问题答案推荐,自动QA等。由于中文词语的多样性和灵活性,本赛题需要选手构建一个重复问题识别算法。
https://challenge.xfyun.cn/topic/info?type=chinese-question-similarity
本次赛题希望参赛选手对两个问题完成相似度打分。
eg:世界上什么东西最恐怖 世界上最恐怖的东西是什么? 1 解析:“世界上什么东西最恐怖”与”世界上最恐怖的东西是什么“问题相同,故是重复问题,标签为1。
from sklearn.metrics import accuracy_score y_pred = [0, 2, 1, 3]y_true = [0, 1, 2, 3]accuracy_score(y_true, y_pred)
!pip install paddle-ernie > log.log
import numpy as npimport paddle as P# 导入ernie,并进行编码测试from ernie.tokenizing_ernie import ErnieTokenizerfrom ernie.modeling_ernie import ErnieModel# Try to get pretrained model from server, make sure you have network connectionmodel = ErnieModel.from_pretrained('ernie-1.0')
model.eval()
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
ids, _ = tokenizer.encode('hello world')
ids = P.to_tensor(np.expand_dims(ids, 0)) # insert extra `batch` dimensionpooled, encoded = model(ids) # eager executionprint(pooled.numpy())downloading https://ernie-github.cdn.bcebos.com/model-ernie1.0.1.tar.gz: 788478KB [00:13, 58506.84KB/s] W1202 20:53:59.563866 128 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1 W1202 20:53:59.568325 128 device_context.cc:422] device: 0, cuDNN Version: 7.6.
[[-1. -1. 0.9947966 -0.99986964 -0.7872068 -1. -0.99919456 0.985997 -0.22648299 0.97202295 -0.9994966 -0.9822341 -0.68219566 -0.9998575 -0.8304648 -0.98049784 -1. 0.99995095 -0.55144906 0.48972973 -1. 1. 0.14248629 -0.7196953 -0.90551496 0.9796572 -0.999682 0.8513557 1. -0.3938747 -0.9999991 0.99999195 -0.9992019 -0.09899103 0.9999599 0.749042 -0.99999213 -0.96575606 -0.9960677 -0.953351 0.9999292 -0.91971534 -0.99989647 0.6303845 -0.9860959 -0.9999173 -1. 0.9999998 -0.99999994 -0.8262453 -0.97446996 0.9990984 1. -0.99999976 -0.9999996 -0.99950695 -0.02158238 -1. 0.9997592 0.9932415 -0.9991993 -1. 0.99999994 0.39329588 -0.31263578 -0.99844974 -0.9835465 -0.99998647 -0.98631394 0.9999959 0.98583347 -0.9996068 -0.9474884 -0.99999726 1. -0.9858532 0.68813705 1. 0.99902004 0.9886897 1. 0.99939865 -0.99691194 0.9998406 1. -1. -0.99998885 0.9622678 1. -0.9073399 -0.99999887 0.41061124 0.35902554 0.9701588 1. -1. 0.99992687 -0.09641007 0.9997865 1. -0.53010523 0.9979866 -0.9999991 0.9999998 -0.56906253 -1. 1. -1. -0.83410954 -0.9972185 0.71370375 0.6173501 0.99986655 1. 1. -0.9991407 -0.9982252 -0.29370877 0.94986635 0.974079 -0.9797859 0.9999882 -0.9997873 0.8926128 -0.999963 0.9875355 0.99996686 -0.99956316 -0.7013649 0.9091245 0.99992985 0.36666185 -0.99957246 -0.99999994 0.999909 0.99999446 0.99999976 -0.9953213 0.99999994 0.972278 1. 0.9999953 -0.7862481 0.99953455 0.8351896 0.99186313 0.9650968 -0.99940604 -0.9712903 -0.9947523 0.99233043 0.99999034 0.85335356 -0.9400231 -0.97488457 1. 1. -1. 0.93516445 0.9987283 1. 0.99971515 -0.9965259 -0.5467724 1. 0.9942181 0.6069191 -1. 0.9966893 -0.94737995 -0.99999994 0.99862695 0.9999889 1. -1. -0.5273155 0.04165626 0.9937646 0.88374007 -1. -0.98892546 1. 0.9999941 -0.99999994 -0.73221755 0.857093 -0.99924487 0.3502608 0.98152554 -0.9999997 0.7171253 0.9718659 -0.9991775 0.99702847 -0.99995285 -0.7097992 -0.99999785 -0.9686826 -0.3610336 -0.90550834 0.99939483 -0.4930981 0.99990445 -0.9978074 0.7300049 0.99393106 -0.9893526 -0.99453 0.9983252 -0.9794313 0.53081703 -0.9979754 -0.9529507 0.9999984 -0.9999998 -0.7356112 0.81415546 1. 0.9892449 -0.83468103 -1. 0.999743 -0.86810464 -0.9997558 -0.20181224 -0.99999547 -0.99999934 -0.2182404 0.9998732 1. 0.9999923 -0.99840605 0.9407252 0.4017602 -0.9992026 -0.82429636 -0.9429531 0.9997898 0.76942927 0.9777227 0.99998903 -1. 0.8595473 0.99960047 -0.9967672 -0.9994331 -0.9997981 0.99998605 0.99965954 -0.99966854 -0.96993816 0.7935611 -0.45320866 0.9830856 -0.5497223 0.99300534 0.99988496 1. -0.9999992 -1. -0.99618727 0.23314574 -0.99719536 -0.80661523 0.32799956 1. 0.99999857 -0.97343165 -1. 0.9978576 -0.9919795 0.99999815 -0.9999999 -0.1334464 1. -0.99961054 0.9971871 -0.9864563 0.16868624 -0.99276286 0.99999714 0.9200265 -0.9242948 -0.99993247 0.9988716 -0.9725936 -1. -1. -1. 0.9969332 0.9999597 0.99504334 0.88538665 0.9835156 0.99972886 0.99856824 -0.9649251 -0.9971054 -1. -1. -0.8998264 -0.93112624 -0.9740408 0.2537215 -1. -0.91917837 -0.999935 0.9993159 1. -1. 0.02620327 -0.9980975 0.99772954 1. -1. 0.99999076 0.9924431 0.9997116 -1. 0.99882114 -0.50127447 1. -0.98255926 -0.8043856 0.9998081 -0.62940514 -0.99988675 -0.9706199 0.9978046 -0.6076621 -1. 1. -0.9982328 0.9995839 -0.8645034 -0.762148 -0.99988246 0.9517748 -0.82121325 0.2945853 -0.99822944 1. 1. 0.9689294 0.99276227 -0.9999997 -0.99927294 1. -0.99918365 0.8140408 0.69133973 0.93894196 -0.8062333 0.9999998 0.9431836 -0.9999073 0.9899104 -0.96207523 -0.20486923 -0.9984225 -0.39250267 -0.9938646 0.1934291 0.99784297 0.9914833 1. 1. -0.92175925 -1. 0.777627 0.9206414 -1. -0.9989347 -0.46215263 0.99968785 0.80873036 -0.20581989 0.8834879 0.9997589 0.99866796 -0.9922761 -0.99996084 0.99986947 -0.82793474 -0.99841046 -0.9990964 0.98211896 1. -0.99999994 0.99616045 0.99999976 -0.99999243 -0.9999997 -0.99999714 1. 0.9939314 0.99152046 0.99771667 -0.99999744 1. 0.98841125 -1. -0.9999999 0.992869 0.7338416 -1. -0.99971646 -0.9977229 1. 0.9964063 -0.99997866 0.9579224 0.99862623 -0.95916873 -0.93258107 1. 0.9999815 -0.99965525 -0.05712385 0.7929279 1. -1. -0.63294727 0.99999905 0.9999928 -0.9138498 -0.58142805 -0.9999974 -0.50225466 -0.9884215 0.62460124 0.9952184 -0.9662707 0.99879575 0.9193481 0.8657087 -0.7023982 0.99981546 1. 0.93622863 0.8811557 -0.80531764 -0.52614504 0.9999964 0.7156265 0.65080976 -1. 0.9999999 0.97201777 0.08372057 -0.95372367 -0.98426515 0.8021691 -0.99989295 0.1394261 0.9476906 0.33132365 0.9190514 -0.6456231 -0.99998987 1. 1. 0.9996623 1. 0.980936 0.9585556 -0.999997 -0.9999968 -0.7713572 -0.32009816 -1. -0.8165567 -0.9010418 -0.91434306 0.9901418 -0.9999998 0.99999607 -0.9999999 -0.998353 -0.7945882 -1. 0.99761385 -0.9991723 0.99999976 0.9976408 -0.9043439 -0.71953666 0.39754778 0.96926165 0.99999946 0.78094065 0.99996835 0.20420592 0.94715124 1. -0.9319126 -0.99999994 0.9995099 0.04715774 -0.9994818 -0.4299773 1. 0.993541 -0.92046857 -1. -0.9998621 0.9452656 -1. -0.93060195 -0.9613411 0.92313194 -1. 0.998021 -1. -0.9650017 -0.9998429 0.9979359 0.9999817 -1. 1. 0.99999255 -0.63863575 0.9985998 -0.99998945 0.9899823 0.99999994 -0.999888 -0.99999803 -0.8187791 0.9694749 -0.99999994 -0.9999899 -0.99989164 -0.9997336 0.7184304 -0.9999976 -0.99996877 -0.24213222 -0.9999994 0.82644814 1. 0.9999992 -0.99702954 0.9999466 0.44653273 0.9999996 0.9047354 0.9982966 0.5623058 0.99999994 -1. -0.9999001 0.98437494 -1. -0.99825895 1. -0.99998593 -0.9999754 0.9998784 1. -0.9999596 0.94500023 -0.23792706 -0.87703276 -1. -0.98388237 -0.9954633 -1. 0.95263404 -0.9994247 -0.9998663 -0.9999824 0.99923533 0.99523705 0.79224235 0.89377266 -0.9999904 -0.99992746 0.9632811 0.9943542 0.9941962 -0.9999978 1. -0.9974183 0.99996316 -0.29181403 0.99776316 0.9827472 0.9976778 1. 0.90089697 0.2652334 -0.9936546 1. -0.91406417 -1. 0.98997 -0.9999998 -1. 0.9895604 0.98788255 0.9973655 -0.9962906 -0.9982991 0.17212316 -0.99960595 0.05793795 -1. 0.99999964 0.99120486 -1. -0.9885665 -0.9902784 -0.75728595 0.99752456 -0.17518732 -0.99129784 -0.99999326 0.99997216 -0.99999934 -0.82791257 0.98660254 0.99603254 -0.67563635 -0.73366654 1. 1. -0.9998628 0.8316432 -0.36662954 0.6111789 0.9988032 0.9810423 -0.9588539 0.99946827 0.9999987 -0.99960554 0.96805805 0.10139702 -1. -0.99993324 0.5664537 0.99999994 -0.99796015 -0.9999974 0.9998277 -0.99928087 -0.9969642 0.9999832 -0.9999965 1. 1. -0.94715935 -0.99986416 0.99730986 1. 0.955229 0.9998091 0.99668235 0.99999964 1. -0.9913356 -0.99994016 1. 0.92216367 0.9958406 0.9846747 -0.9994917 0.94751567 0.9999906 0.9875233 0.9999849 -0.9970721 -1. 0.06024149 0.91710305 -0.999989 1. 0.9999998 -0.70932066 0.9987246 0.9995062 0.32136855 0.9316998 1. -0.9996941 0.9998097 -0.99604577 0.91744435 0.9999964 0.999698 -0.9854849 -0.99755925 0.88405 -0.92540944 -0.99934196 -0.81233275 0.9076926 -0.99119806 0.50136054 0.99026984 -0.9805377 -0.9999451 0.98480725 0.38554776 -0.5592141 0.9901851 0.8039652 1. -1. -0.9997191 0.80112654 0.6773012 0.995455 0.99996954 0.9993313 0.00101695 -0.9999991 0.99610806 -0.9970495 0.99996704 -0.99999976 -1. 0.99973965 0.98511857 0.9999948 1. 0.99614286 0.99999726 0.9982908 -0.7917839 0.99584585 -0.9914651 -1. 0.9999999 1. 0.999995 0.14218074 0.99049306 -0.9999999 -0.99067837 0.99999744 -0.99533266 0.98921704 0.93848974 0.8418761 1. 0.999998 0.980067 0.99886674 0.9999988 0.99946433 0.98491013 0.9996923 -0.7944232 -0.99994105 0.99827063 1. -0.0576716 0.9999987 0.81761754 0.7983499 -0.14292398 1. -0.9975951 -0.9999982 -0.9997338 -0.99937415]]
import sysimport numpy as npimport pandas as pdfrom sklearn.metrics import f1_scoreimport paddle as Pfrom ernie.tokenizing_ernie import ErnieTokenizerfrom ernie.modeling_ernie import ErnieModelForSequenceClassification
# 读取数据集train_df = pd.read_csv('train.csv', sep='\t', names=['question1', 'question2', 'label'])
train_df = train_df.sample(frac=1.0)
train_df.head()question1 question2 label 2570 关于西游记的问题 关于《西游记》的一些问题 1 4635 星天牛吃什么呀? 天牛是什么 0 1458 亲稍等我帮您查询下 亲请您稍等一下这边帮您核实一下可以吗 1 4750 备份怎么下载 云备份怎么下载 1 2402 最近有什么战争? 最近有什么娱乐新闻 0
# 对句子对进行编码tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
tokenizer.encode('泰囧完整版下载', 'エウテルペ完整版下载')(array([ 1, 1287, 4663, 328, 407, 511, 86, 763, 2,
17963, 17963, 17963, 17963, 17963, 328, 407, 511, 86,
763, 2]),
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))# 模型超参数BATCH=16MAX_SEQLEN=72LR=5e-5EPOCH=1# 文本分类模型ernie = ErnieModelForSequenceClassification.from_pretrained('ernie-1.0', num_labels=2)
optimizer = P.optimizer.Adam(LR,parameters=ernie.parameters())
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ernie/modeling_ernie.py:296: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
log.warn('param:%s not set in pretrained model, skip' % k)
[WARNING] 2021-12-02 21:05:35,518 [modeling_ernie.py: 296]: param:classifier.weight not set in pretrained model, skip
[WARNING] 2021-12-02 21:05:35,519 [modeling_ernie.py: 296]: param:classifier.bias not set in pretrained model, skip# 对训练集所有句子进行编码def make_data(df):
data = [] for i, row in enumerate(df.iterrows()):
text_id, _ = tokenizer.encode(row[1].question1, row[1].question2)
text_id = text_id[:MAX_SEQLEN]
text_id = np.pad(text_id, [0, MAX_SEQLEN-len(text_id)], mode='constant')
data.append((text_id, row[1].label)) return data
train_data = make_data(train_df.iloc[:-1000])
val_data = make_data(train_df.iloc[-1000:])# 得到batch数据def get_batch_data(data, i):
d = data[i*BATCH: (i + 1) * BATCH]
feature, label = zip(*d)
feature = np.stack(feature) # 将BATCH行样本整合在一个numpy.array中
label = np.stack(list(label))
feature = P.to_tensor(feature) # 使用to_variable将numpy.array转换为paddle tensor
label = P.to_tensor(label) return feature, label# 模型训练与验证for i in range(EPOCH):
np.random.shuffle(train_data) # 每个epoch都shuffle数据以获得最佳训练效果;
ernie.train() for j in range(len(train_data) // BATCH):
feature, label = get_batch_data(train_data, j)
loss, _ = ernie(feature, labels=label)
loss.backward()
optimizer.minimize(loss)
ernie.clear_gradients() if j % 50 == 0: print('Train %d: loss %.5f' % (j, loss.numpy()))
# 验证
if j % 100 == 0:
all_pred, all_label = [], [] with P.no_grad():
ernie.eval() for j in range(len(val_data) // BATCH):
feature, label = get_batch_data(val_data, j)
loss, logits = ernie(feature, labels=label)
all_pred.extend(logits.argmax(-1).numpy())
all_label.extend(label.numpy())
ernie.train()
acc = (np.array(all_label) == np.array(all_pred)).astype(np.float32).mean() print('Val acc %.5f' % acc)Train 0: loss 0.70721 Val acc 0.54335 Train 50: loss 0.27199 Train 100: loss 0.35329 Val acc 0.88004 Train 150: loss 0.16315 Train 200: loss 0.08876 Val acc 0.89113
test_df = pd.read_csv('test.csv', sep='\t', names=['question1', 'question2'])
test_df['label'] = 0test_data = make_data(test_df.iloc[:])# 模型预测all_pred, all_label = [], []with P.no_grad():
ernie.eval() for j in range(len(test_data) // BATCH):
feature, label = get_batch_data(test_data, j)
loss, logits = ernie(feature, labels=label)
all_pred.extend(logits.argmax(-1).numpy())
all_label.extend(label.numpy())pd.DataFrame({ 'label': all_pred,
}).to_csv('submit.csv', index=None)以上就是科大讯飞-中文问题相似度挑战赛:ERNIE-NSP 准确率0.866的详细内容,更多请关注php中文网其它相关文章!
每个人都需要一台速度更快、更稳定的 PC。随着时间的推移,垃圾文件、旧注册表数据和不必要的后台进程会占用资源并降低性能。幸运的是,许多工具可以让 Windows 保持平稳运行。
Copyright 2014-2025 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号