本文围绕广告高价值转化行为预测赛题展开,介绍了背景、数据、评分标准及解题思路。赛题要求构建转化率预估模型,提供了含用户及广告信息的脱敏数据。解题思路包括数据处理为一维向量、搭建DeepFM网络,还给出相关代码,最终按auc和pcoc指标评判。
☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

随着互联网的发展,数字营销已成为企业获取潜在客户、用户获取目标商品和服务的主要渠道之一。 转化率预估技术可以在数字营销场景提高企业广告投放精准度、减少无效信息对用户的干扰,为用户带来更好的信息服务体验。 本任务提供了一个的 baseline模型,旨在帮助参赛者快速入手提高模型性能和效果。
PS : 本人也是第一次尝试完成推荐系统相关的数据处理以及demo实现,若有什么问题请评论区留言交流,或加微信沟通(同昵称)
本次比赛提供N天广告日志抽样数据(数据已脱敏):
本次任务提供百度广告投放场景的海量真实数据,希望参赛者通过采用机器学习方式,构建转化率预估模型, 根据给定用户及广告信息,预估转化概率,即 pcvr = p(cv=1 | user, ad)。
数据格式 每一行数据为一次广告点击,共计5个域,各域间 \t 分隔:
| 字段 | 说明 |
|---|---|
| log_id | 样本id |
| t1 | 转化类型1 |
| t2 | 转化类型2 |
| t3 | 转化类型3 |
| 特征 | 包含用户及广告信息 |
样本示例: (1)logid \t 1 \t - \t - \t 特征字段,属于第一个转化类型,是正样本 (2)logid \t - \t 1 \t - \t 特征字段,属于第二个转化类型,是正样本 (3)logid \t - \t - \t 1 \t 特征字段,属于第三个转化类型,是正样本 (4)logid \t 0 \t - \t - \t 特征字段,属于第一个转化类型,是负样本 (5)logid \t - \t 0 \t - \t 特征字段,属于第二个转化类型,是负样本 (6)logid \t - \t - \t 0 \t 特征字段,属于第三个转化类型,是负样本
example:3601136181 - - 0 704:1 227:2 4630318:3 4630319:4 5:5 6:6 319:7 8:8 75:9 1715101:10 63328:10 412248:10 263656:10 4630320:10 150154:10 412250:10 63336:10 465:11 21:13 22:14 29442:15 24:16 630457:17 4630321:17 776254:17 4630322:17 325042:17 4630323:17 238029:17 325041:17 1690121:17 4630324:17 31148:18 3462866:18 174535:18 174536:18 675808 :18 675809:18 131969:18 36517:18 4259795:18 616316:18 2085787:19 30093:20 31406:20 31407:20 63351:20 46:21 4630325:22 4630326:22 4630327:22 4630328:22 4285695:22 4630329:22 4630330:22 4630331:22 4630332:22 4630333:22 2431996:22 4630334:22 1037304:22 4630335:22 1984706:22 4630336:22 2645081:22 816199:22 4630337:22 4630338:22 2085788:23 3161:24 3742:25 150:26 601:26 182:26 62:26 184:26 185:26 345:26 155:26 156:26 1258:26 158:26 70:26
特征字段中包含26个不同的特征,各特征的信息类别如下:
| 字段号 | 说明 |
|---|---|
| 标号1~13 | 用户基础信息 |
| 标号14~16 | 场景信息 |
| 标号17~24 | 广告信息 |
| 标号25~26 | 用户多值信息 |
注:本次比赛分为初赛和复赛,初赛的测试集随本项目进行发布,复赛数据集,将在复赛开始时间进行发布
所有样本已按时间先后排序,用户及广告信息已转为id进行脱敏。
使用组合特征及多目标建模有助于提高模型效果。
比赛最终根据所有转化类型样本整体auc排名。
0.90 =< pcoc <= 1.10方可进入榜单排序。 主排序指标为auc,auc相同情况下按照 abs(pcoc - 1)升序。 指标说明:
auc(Area under curve):ROC曲线下的面积,越接近于1越好
pcoc(predict cv over cv) :预估转化率 / 真实转化率,越接近于1越好
# 45-50 s 只需运行一次!unzip /home/aistudio/data/data213105/train_test.zip -d ./
将每条数据处理为简单粗暴的一维向量
具体逻辑详见:
queuedataset_reader.py
dataset.py
# queuedataset_reader.pyimport paddle.distributed.fleet as fleetclass Reader(fleet.MultiSlotDataGenerator):
def __init__(self):
padding = 0
# 无稠密特征
sparse_slots = "log_key click 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26"
self.sparse_slots = sparse_slots.strip().split(" ")
self.slots = self.sparse_slots
self.slot2index = {}
self.visit = {} for i in range(len(self.slots)):
self.slot2index[self.slots[i]] = i
self.visit[self.slots[i]] = False
self.padding = padding def line_process(self, line):
items = line.strip("\n").split("\t")
log_key = int(items[0])
conv1 = items[1]
conv2 = items[2]
conv3 = items[3]
conv = 0
if conv1 == "1" or conv2 == "1" or conv3 == "1":
conv = 1
else:
conv = 0
output = [(i, []) for i in self.slots]
feasigns = items[4].split(" ") for i in feasigns:
slot_feasign = i.split(":")
slot = slot_feasign[1] if slot not in self.slots: continue
if slot in self.sparse_slots:
feasign = int(slot_feasign[0]) else:
feasign = float(slot_feasign[0])
output[self.slot2index[slot]][1].append(feasign)
self.visit[slot] = True
output[0][1].append(log_key)
self.visit['log_key'] = True
output[1][1].append(conv)
self.visit['click'] = True
for i in self.visit:
slot = i if not self.visit[slot]:
output[self.slot2index[i]][1].extend([self.padding]) else:
self.visit[slot] = False
new_output = [] for i in range(len(output)): if len(output[i][1]) > 1:
new_output.append(int(sum(output[i][1])/len(output[i][1]))) else:
new_output.append(output[i][1][0])
return new_output
def generate_sample(self, line):
r"Dataset Generator"
def reader():
output_dict = self.line_process(line) yield output_dict return reader
if __name__ == '__main__':
reader = Reader() with open('train_data_small/file_01.txt') as f: for line in f: print(reader.line_process(line)) break# dataset.pyimport collectionsimport paddleimport queuedataset_reader as readerimport randomfrom tqdm import tqdmclass MyDataset(paddle.io.Dataset):
def __init__(self, file_paths, train_ratio=0.8):
super(MyDataset, self).__init__()
self.file_paths = file_paths
self.reader = reader.Reader()
self.total_samples = 0
self.train_data = []
self.val_data = []
self.train_ratio = train_ratio
self._load_data() def _load_data(self):
for file_path in tqdm(self.file_paths, desc='load file and data',ncols=100):
samples = self._load_file(file_path)
self.total_samples += len(samples)
self.train_data.extend(samples)
self._split_dataset() def _load_file(self, file_path):
samples = [] with open(file_path) as f: for line in f:
sample = self.reader.line_process(line)
feature = sample[2:]
label = sample[1]
samples.append((feature, label)) return samples
def _split_dataset(self):
num_train = int(self.total_samples * self.train_ratio)
random.shuffle(self.train_data)
# 均分数据集类别,防止训练集和验证集中类别不均衡
class_samples = collections.defaultdict(list) for feature, label in self.train_data:
class_samples[label].append((feature, label))
num_classes = len(class_samples)
num_samples_per_class = num_train // num_classes
train_samples = []
val_samples = [] for class_label, samples in class_samples.items():
num_samples = len(samples)
num_train_samples = min(num_samples_per_class, num_samples)
num_val_samples = num_samples - num_train_samples
train_samples.extend(random.sample(samples, num_train_samples))
val_samples.extend(samples[num_train_samples:])
random.shuffle(train_samples)
random.shuffle(val_samples)
self.train_data, self.val_data = train_samples, val_samples[:num_train - len(train_samples)] def __getitem__(self, idx):
if idx < len(self.train_data):
sample = self.train_data[idx] else:
sample = self.val_data[idx - len(self.train_data)] return sample[0], sample[1] def __len__(self):
return self.total_samplesif __name__ == '__main__':
file_paths = ['train_data/file_01.txt', 'train_data/file_02.txt']
dataset = MyDataset(file_paths, train_ratio=0.8) print(len(dataset.train_data)) print(len(dataset.val_data))
train_loader = paddle.io.DataLoader(dataset.train_data, batch_size=32, shuffle=True, num_workers=0)
val_loader = paddle.io.DataLoader(dataset.val_data, batch_size=32, shuffle=False, num_workers=0) for epoch in range(1): for batch_id, data in enumerate(train_loader()):
x = data[0]
y = data[1] # 32 转换为 32*1 然后拼接为 32*26
for i in range(len(x)):
x[i] = paddle.unsqueeze(x[i], axis=1)
x = paddle.concat(x, axis=1)
print(x.shape) # 将 32 转换为 32*1
y = paddle.unsqueeze(y, axis=1) print(y.shape)
print(x) print(y)
break我这里只是一个可以跑通的demo展示,所以对模型不再深入研究,跑通即可。
当然,这里不做效果的评价,只是作为一个baseline的编写,带领大家能够迅速的拉通整个项目。
模型的优化还是交给大家!毕竟那么多奖金呢!
模型详见文件:
deepfm.py
# deepfm.pyimport mathimport paddleimport paddle.nn as nnclass FM(nn.Layer):
def __init__(self, sparse_feature_number, sparse_feature_dim):
super(FM, self).__init__()
self.sparse_feature_number = sparse_feature_number
self.sparse_feature_dim = sparse_feature_dim
self.init_value_ = 0.1
self.embedding = paddle.nn.Embedding(
self.sparse_feature_number,
self.sparse_feature_dim,
sparse=True,
weight_attr=paddle.ParamAttr(
name="SparseFmFeatFactors",
initializer=paddle.nn.initializer.TruncatedNormal(
mean=0.0,
std=self.init_value_ /
math.sqrt(float(self.sparse_feature_dim)))))
self.bias = paddle.create_parameter(
shape=[1],
dtype='float32',
default_initializer=paddle.nn.initializer.TruncatedNormal(
mean=0.0,
std=self.init_value_ /
math.sqrt(float(self.sparse_feature_dim)))) def forward(self, sparse_inputs):
emb = self.embedding(sparse_inputs.astype('int64'))
square_of_sum = paddle.square(paddle.sum(emb, axis=1))
sum_of_square = paddle.sum(paddle.square(emb), axis=1)
fm = 0.5 * paddle.sum(square_of_sum - sum_of_square, axis=1, keepdim=True)
fm = fm + self.bias return fmclass DNN(nn.Layer):
def __init__(self, num_field, layer_sizes, sparse_feature_number, sparse_feature_dim):
super(DNN, self).__init__()
self.num_field = num_field
self.layer_sizes = layer_sizes
self.sparse_feature_dim = sparse_feature_dim
self.sparse_feature_number = sparse_feature_number
self.init_value_ = 0.1
self.embedding = paddle.nn.Embedding(
self.sparse_feature_number,
self.sparse_feature_dim,
sparse=True,
weight_attr=paddle.ParamAttr(
name="SparseDnnFeatFactors",
initializer=paddle.nn.initializer.TruncatedNormal(
mean=0.0,
std=self.init_value_ /
math.sqrt(float(self.sparse_feature_dim)))))
sizes = [self.num_field * self.sparse_feature_dim] + self.layer_sizes
acts = ["relu" for _ in range(len(self.layer_sizes))] + [None]
self._mlp_layers = [] for i in range(len(layer_sizes)):
linear = paddle.nn.Linear(
in_features=sizes[i],
out_features=sizes[i + 1],
weight_attr=paddle.ParamAttr(
initializer=paddle.nn.initializer.Normal(
std=1.0 / math.sqrt(sizes[i]))))
self.add_sublayer('linear_%d' % i, linear)
self._mlp_layers.append(linear) if acts[i] == 'relu':
act = paddle.nn.ReLU()
self.add_sublayer('act_%d' % i, act)
self._mlp_layers.append(act) def forward(self, sparse_inputs):
emb = self.embedding(sparse_inputs.astype('int64'))
y_dnn = paddle.reshape(emb, shape=[-1, self.num_field * self.sparse_feature_dim]) for n_layer in self._mlp_layers:
y_dnn = n_layer(y_dnn) return y_dnnclass DeepFM(nn.Layer):
def __init__(self, num_field, layer_sizes, sparse_feature_number=88000000, sparse_feature_dim=8):
super(DeepFM, self).__init__()
self.num_field = num_field
self.layer_sizes = layer_sizes
self.sparse_feature_number = sparse_feature_number
self.sparse_feature_dim = sparse_feature_dim
self.fm = FM(self.sparse_feature_number, self.sparse_feature_dim)
self.dnn = DNN(self.num_field, self.layer_sizes, self.sparse_feature_number, self.sparse_feature_dim)
self.fc = paddle.nn.Linear(
in_features=1 + self.layer_sizes[-1],
out_features=1,
weight_attr=paddle.ParamAttr(
initializer=paddle.nn.initializer.Normal(std=1.0 / math.sqrt(1 + self.layer_sizes[-1]))))
self.add_sublayer('fc', self.fc) def forward(self, sparse_inputs):
fm_out = self.fm(sparse_inputs)
dnn_out = self.dnn(sparse_inputs)
concat_out = paddle.concat([fm_out, dnn_out], axis=1)
y = self.fc(concat_out)
pred = paddle.nn.functional.sigmoid(y)
return pred
if __name__ == '__main__':
model = DeepFM(26, [512, 256, 128,32],sparse_feature_dim=9)
paddle.summary(model, (100, 26))
# data = [[159, 259, 4630199, 4630200, 163, 164, 165, 650, 167, 79129, 3135, 0, 652361, 169, 714906, 24, 0, 1520746, 1000438, 17249, 298, 0, 1000441, 3251, 1686, 181]]
# data = paddle.to_tensor(data)
# print(model(data))训练脚本,我这里采用的是MLP模型进行的训练,大家可以自己改进更换。
(推荐使用TeslaA100 32G 版本) 否则可能会有读取数据内存溢出,这一点是可以优化的,但是在平台上有32的选择,就无须纠结了。
train.py
# 训练一次需要很久,请根据自己的情况去调整epoch!python train.py
以上就是2023百度商业AI技术创新大赛-赛道1【非官方baseline】的详细内容,更多请关注php中文网其它相关文章!
每个人都需要一台速度更快、更稳定的 PC。随着时间的推移,垃圾文件、旧注册表数据和不必要的后台进程会占用资源并降低性能。幸运的是,许多工具可以让 Windows 保持平稳运行。
Copyright 2014-2025 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号