使用MultiOutputClassifier训练多标签分类模型

DDD

发布时间：2025-08-13 18:08:15

948人浏览过

来源于php中文网

原创

使用multioutputclassifier训练多标签分类模型

本文旨在解决在使用MultiOutputClassifier进行多标签分类时遇到的数据维度不一致问题。通过分析错误信息和代码，找到问题根源在于训练集和目标变量的样本数量不匹配。本文将提供详细的步骤和代码示例，帮助读者正确地训练多标签分类模型，并避免类似错误。

问题分析

从提供的错误信息ValueError: Found input variables with inconsistent numbers of samples: [132, 33]可以看出，X_train和y_train的样本数量不一致，导致LogisticRegression无法进行训练。X_train的形状是(33, 132)，而错误信息表明，在内部的某个环节，y的长度变成了132。

根据问题描述，目标是训练一个多标签分类模型，根据坐标预测人的位置和姿势。数据集中包含class和stand两个标签。

解决方案

根据提供的答案和错误信息，问题在于train_test_split函数的使用可能存在问题。虽然代码中使用了y进行分割，但错误信息暗示可能存在一个名为y3的变量，并且它的形状与X不匹配。最常见的错误是，在后续代码中误用了其他变量，或者数据处理过程中出现了错误。

以下是更正后的代码和详细解释：

1. 导入必要的库：

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

2. 加载数据：

df = pd.read_csv('deadlift.csv')
print(df.head())

3. 划分特征和目标变量：

创想C2C多用户商城系统

创想C2C商城系统，系统功能仿照淘宝设计，采用模块标签技术和静态html生成技术基于Asp.Net/C#+SQL的开发的创想多用户商城系统，具有智能化、高扩展、稳定安全等特性，后台可自由添加频道，自由修改界面风格，商品无限级分类，支持在线支付整合，通过安装和使用创想C2C商城系统，就可以轻松建立起专业大型的网上交易平台。创想C2C多用户商城系统5.6.3.8版本升级功能1.网站地区设置功能的增

下载

X = df.drop(['class', 'stand'], axis=1)
y = df[['class', 'stand']]
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

4. 划分训练集和测试集：

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1234)
print("Shape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_test:", y_test.shape)

5. 构建和训练模型：

由于原始代码使用了CountVectorizer，这通常用于文本数据。如果你的数据是数值型的坐标数据，则不需要使用CountVectorizer。直接使用数据进行训练即可。

# 如果数据是文本类型，保留CountVectorizer
# model1 = Pipeline(steps=[('cv', CountVectorizer(lowercase=False)), ('lr_multi', MultiOutputClassifier(LogisticRegression()))])

# 如果数据是数值类型，直接使用MultiOutputClassifier
model1 = MultiOutputClassifier(LogisticRegression())

model1.fit(X_train, y_train)

6. 预测和评估模型：

y_pred = model1.predict(X_test)
print(y_pred)

由于是多标签分类，评估指标需要根据具体任务选择，例如：hamming loss, accuracy, precision, recall, f1-score等。

完整代码示例（假设数据是数值型）：

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import hamming_loss

# 加载数据
df = pd.read_csv('deadlift.csv')

# 划分特征和目标变量
X = df.drop(['class', 'stand'], axis=1)
y = df[['class', 'stand']]

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1234)

# 构建和训练模型
model1 = MultiOutputClassifier(LogisticRegression())
model1.fit(X_train, y_train)

# 预测
y_pred = model1.predict(X_test)

# 评估模型
hl = hamming_loss(y_test, y_pred)
print(f"Hamming Loss: {hl}")

注意事项

数据类型： 确保输入模型的数据类型正确。如果数据是文本，则需要使用CountVectorizer或TfidfVectorizer进行向量化。如果数据是数值型，则可以直接输入模型。
数据维度： 务必检查X_train、X_test、y_train和y_test的维度，确保训练集和测试集的样本数量一致。
变量命名： 仔细检查代码，避免变量名拼写错误，尤其是在train_test_split之后。
评估指标： 选择合适的评估指标来评估多标签分类模型的性能。
数据预处理： 坐标数据通常需要进行标准化或归一化，以提高模型的训练效果。可以考虑使用StandardScaler或MinMaxScaler。