01-数据预处理
本章讲解机器学习中的数据预处理技术,包括数据清洗、特征工程和数据集划分。
实际场景
你正在处理一份客户调查数据,发现数据非常"脏":有些字段缺失(年龄、收入未填写),有些数据明显不合理(年龄填了 200 岁),还有些类别数据需要转换成数值(性别、职业等)。如何将这些原始数据转换为机器学习模型可以使用的整洁数据?这就是数据预处理的核心任务。
数据清洗
缺失值处理
python
import pandas as pd
from sklearn.impute import SimpleImputer
from numpy.typing import NDArray
df_dropped: pd.DataFrame = df.dropna()
imputer: SimpleImputer = SimpleImputer(strategy='mean')
df_filled: NDArray = imputer.fit_transform(df)异常值检测
python
import numpy as np
from numpy.typing import NDArray
data: NDArray = np.array([1, 2, 3, 100, 5, 6])
mean: float = np.mean(data)
std: float = np.std(data)
outliers: list[float] = [x for x in data if abs(x - mean) > 3 * std]
Q1: float
Q3: float
Q1, Q3 = np.percentile(data, [25, 75])
IQR: float = Q3 - Q1
outliers_iqr: list[float] = [x for x in data if x < Q1 - 1.5*IQR or x > Q3 + 1.5*IQR]特征工程
类别特征编码
python
from sklearn.preprocessing import LabelEncoder
import pandas as pd
le: LabelEncoder = LabelEncoder()
df['颜色_标签']: pd.Series = le.fit_transform(df['颜色'])
onehot: pd.DataFrame = pd.get_dummies(df['颜色'], prefix='颜色')特征缩放
python
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from numpy.typing import NDArray
scaler: StandardScaler = StandardScaler()
X_scaled: NDArray = scaler.fit_transform(X)
minmax: MinMaxScaler = MinMaxScaler()
X_normalized: NDArray = minmax.fit_transform(X)特征选择
python
from sklearn.feature_selection import SelectKBest, f_classif
from numpy.typing import NDArray
selector: SelectKBest = SelectKBest(score_func=f_classif, k=5)
X_selected: NDArray = selector.fit_transform(X, y)数据集划分
训练集/验证集/测试集
python
from sklearn.model_selection import train_test_split
from numpy.typing import NDArray
X_temp: NDArray
X_test: NDArray
y_temp: NDArray
y_test: NDArray
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2)
X_train: NDArray
X_val: NDArray
y_train: NDArray
y_val: NDArray
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25)交叉验证
python
from sklearn.model_selection import cross_val_score
from numpy.typing import NDArray
scores: NDArray = cross_val_score(model, X, y, cv=5)
print(f"平均得分:{scores.mean():.4f}")本章小结
┌─────────────────────────────────────────────────────────────┐
│ 数据预处理 知识要点 │
├─────────────────────────────────────────────────────────────┤
│ │
│ 数据清洗: │
│ ✓ 缺失值:删除、统计值填充 │
│ ✓ 异常值:3σ原则、IQR 方法 │
│ │
│ 特征工程: │
│ ✓ 类别编码:标签编码、独热编码 │
│ ✓ 特征缩放:标准化、归一化 │
│ ✓ 特征选择:SelectKBest、RFE │
│ │
│ 数据集划分: │
│ ✓ train_test_split() │
│ ✓ cross_val_score() 交叉验证 │
│ │
└─────────────────────────────────────────────────────────────┘