Task4_模型调参

基础学习:

Liner Regression

$f(x)=w^{\prime} x+b$

1
2
3
4
5
#w是列向量 矩阵由一个个列向量构成 y = dot(w_t,X)+b
import numpy as np
w_t,b = np.array([1,2,3,4,5]),1
X = np.array([[1,1,1,1,1],[1,2,5,3,4],[5,5,5,5,5]]).T
y_hat = np.dot(w,X) + b

策略:通过loss衡量w和b。$loss=(f(x)-y)^{2}$

优化:

  • 最小二乘[公式]^{-1})

    1
    w_t = np.dot(np.dot(y,X.T), np.linalg.inv(np.dot(X,X.T)))
  • 梯度下降

    [公式]X^T)

    [公式]

    1
    2
    3
    4
    5
    while True:
    grad = 2 *np.dot((np.dot(w_t,x)-y),x.t)
    w_t -=0.1 * grad
    if np.linalg.norm(w_t,ord = 2)<1e-3:
    break

梯度提升树GBDT

CART回归树

GBDT是一个集成模型,可以看做是很多个基模型的线性相加,其中的基模型就是CART回归树。

CART树是一个二分决策树模型,每个节点特征取值为“是”和“不是”。

  • 回归树生成:一个回归树对应的输入空间的一个划分,假设已经将输入空间划分为M个单元R1,R2,R3….Rm,并且每个单元都有固定的输出值cm,其中I为判别函数。

    输入:训练数据集D={(x1,y1),(x2,y2),….,(xn,yn)}

    输出:一颗回归树 $f(x)=\sum_{m=1}^{M} \hat{c}_{m} I\left(x \in R_{m}\right)$

    [公式]

    可知:Rm上的最优输出值就是Rm内所有样本xi对应yi的均值。

    如何对输入空间进行划分?:选择最优的特征作为划分。如果特征是连续变量,R1 = {x|x^j==s} and R2 = {x|x^j != s};寻找最优的变量j和最优切分点s:

  • 回归树剪枝

    先从整体树T0开始,对内部任意节点t,比较以t作为单节点树的损失与t作为根节点树的损失,如果单节点比下面一棵树还要好,就给剪掉。剪完后的树继续重复此过程。

    loss:$C_{a}(T)=C(T)+\alpha|T|$

    C(T)为对训练数据的误差,|T|为树的节点数量,alpha调节的是要精准度还是要模型的简单性。(过拟合与欠拟合)

GBDT

模型定义如下:$f_{t}(x)=\sum_{t=1}^{T} h_{t}(x)$;f­­­t­(x)表示第t轮的模型,ht(x)表示第t颗决策树

采用前向分步算法:$f_{t}(x)=f_{t-1}(x)+h_{t}(x)$

Loss:[公式]

第t轮的第i个样本的损失函数的负梯度表示为:[公式]

[公式]

利用(xi, $r_{ti}$) (i=1,2,…m),我们可以拟合一颗CART回归树,得到了第t颗回归树,其对应的叶节点区域Rtj, j=1,2,…,J。其中J为叶子节点的个数。

针对每一个叶子节点里的样本,我们求出使损失函数最小,也就是拟合叶子节点最好的的输出值ctj如下(注意这里的yi是真实值,不是残差):

[公式]

此时本轮的决策树拟合函数就得到了:

然后本轮的强学习器也就得到了:

之后一直迭代下去,直到损失函数收敛.

Xgboos

目标函数:$\mathcal{L}^{(t)}=\sum_{i=1}^{n} l\left(y_{i}, \hat{y}_{i}^{(t-1)}+f_{t}\left(\mathbf{x}_{i}\right)\right)+\Omega\left(f_{t}\right)$

  • 规整项$\Omega\left(f_{t}\right)$是一个递归的式子,规整项仅仅是第t颗树的,具体起来就是这棵树所有叶子节点权重向量的二范数。

$\mathcal{L}^{(t)} \simeq \sum_{i=1}^{n}\left[l\left(y_{i}, \hat{y}^{(t-1)}\right)+g_{i} f_{t}\left(\mathbf{x}_{i}\right)+\frac{1}{2} h_{i} f_{t}^{2}\left(\mathbf{x}_{i}\right)\right]+\Omega\left(f_{t}\right)$

  • 泰勒展开,假设残差接近于零。

  • 麦克劳林:$f(x)=f(0)+\frac{f^{\prime}(0)}{1 !} x+\frac{f^{\prime \prime}(0)}{2 !} x^{2}+\frac{f^{\prime \prime \prime}(0)}{3 !} x^{3}+\dots+\frac{f^{(n)}(0)}{n !} x^{n}+R_{n}(x)$

$\begin{aligned} \tilde{\mathcal{L}}^{(t)} &=\sum_{i=1}^{n}\left[g_{i} f_{t}\left(\mathbf{x}_{i}\right)+\frac{1}{2} h_{i} f_{t}^{2}\left(\mathbf{x}_{i}\right)\right]+\gamma T+\frac{1}{2} \lambda \sum_{j=1}^{T} w_{j}^{2} \ &=\sum_{j=1}^{T}\left[\left(\sum_{i \in I_{j}} g_{i}\right) w_{j}+\frac{1}{2}\left(\sum_{i \in I_{j}} h_{i}+\lambda\right) w_{j}^{2}\right]+\gamma T \end{aligned}$

  • 把样本 i 归类到所在的叶子节点 j 上,改写目标函数的形式如上

w的解析解:$w_{j}^{*}=-\frac{\sum_{i \in I_{j}} g_{i}}{\sum_{i \in I_{j}} h_{i}+\lambda}$

代入可得到:$\tilde{\mathcal{L}}^{(t)}(q)=-\frac{1}{2} \sum_{j=1}^{T} \frac{\left(\sum_{i \in I_{j}} g_{i}\right)^{2}}{\sum_{i \in I_{j}} h_{i}+\lambda}+\gamma T$

LightGBM

算法学习

线性回归

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#简单建模
from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
model = model.fit(train_X, train_y)


#查看训练的线性回归模型的截距(intercept)与权重(coef)
'intercept:'+ str(model.intercept_)
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)


#绘制特征v_9的值与标签的散点图
from matplotlib import pyplot as plt
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()

观察标签分布。

1
2
3
4
5
6
7
import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])

log(x+1)变换

1
2
3
4
5
6
7
8
9
train_y_ln = np.log(train_y + 1)

import seaborn as sns
print('The transformed price seems like normal distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])

再次拟合验证

1
2
3
4
model = model.fit(train_X, train_y_ln)

print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

五折交叉验证

不会把所有的数据集都拿来训练,而是分出一部分来(这一部分不参加训练)对训练集生成的参数进行测试,相对客观的判断这些参数对训练集之外的数据的符合程度。这种思想就称为交叉验证(Cross Validation)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, make_scorer

def log_transfer(func):
def wrapper(y, yhat):
result = func(np.log(y), np.nan_to_num(np.log(yhat)))
return result
return wrapper

#使用线性回归模型,对未处理标签的特征数据进行五折交叉验证
scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))
print('AVG:', np.mean(scores))

#使用线性回归模型,对处理过标签的特征数据进行五折交叉验证
scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))
print('AVG:', np.mean(scores))


scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
scores

绘制学习率曲线与验证曲线

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from sklearn.model_selection import learning_curve, validation_curve
? learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel('Training example')
plt.ylabel('score')
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()#区域
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1,
color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color='r',
label="Training score")
plt.plot(train_sizes, test_scores_mean,'o-',color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt

plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_ln[:1000], ylim=(0.0, 0.5), cv=5, n_jobs=1)

多模型对比

1
2
3
4
5
train = sample_feature[continuous_feature_names + ['price']].dropna()

train_X = train[continuous_feature_names]
train_y = train['price']
train_y_ln = np.log(train_y + 1)

嵌入式选择最常用的是L1正则化与L2正则化。在对线性回归模型加入两种正则化方法后,他们分别变成了岭回归与Lasso回归。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

models = [LinearRegression(),
Ridge(),
Lasso()]

result = dict()
for model in models:
model_name = str(model).split('(')[0]
scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
result[model_name] = scores
print(model_name + ' is finished')

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result
1
2
3
model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

L2正则化在拟合过程中通常都倾向于让权值尽可能小,最后构造一个所有参数都比较小的模型。

1
2
3
model = Ridge().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

L1正则化有助于生成一个稀疏权值矩阵,进而可以用于特征选择。

1
2
3
model = Lasso().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

部分常用非线性模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor

models = [LinearRegression(),
DecisionTreeRegressor(),
RandomForestRegressor(),
GradientBoostingRegressor(),
MLPRegressor(solver='lbfgs', max_iter=100),
XGBRegressor(n_estimators = 100, objective='reg:squarederror'),
LGBMRegressor(n_estimators = 100)]

result = dict()
for model in models:
model_name = str(model).split('(')[0]
scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
result[model_name] = scores
print(model_name + ' is finished')

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

模型调参

1
2
3
4
5
6
7
8
9
## LGB的参数集合:

objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair']

num_leaves = [3,5,10,15,20,40, 55]
max_depth = [3,5,10,15,20,40, 55]
bagging_fraction = []
feature_fraction = []
drop_rate = []
  • 贪心调参

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    best_obj = dict()
    for obj in objective:
    model = LGBMRegressor(objective=obj)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_obj[obj] = score

    best_leaves = dict()
    for leaves in num_leaves:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_leaves[leaves] = score

    best_depth = dict()
    for depth in max_depth:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
    num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
    max_depth=depth)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_depth[depth] = score


    sns.lineplot(x=['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], y=[0.143 ,min(best_obj.values()), min(best_leaves.values()), min(best_depth.values())])
  • Grid Search 调参

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    from sklearn.model_selection import GridSearchCV

    parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}
    model = LGBMRegressor()
    clf = GridSearchCV(model, parameters, cv=5)
    clf = clf.fit(train_X, train_y)

    clf.best_params_
    model = LGBMRegressor(objective='regression',
    num_leaves=55,
    max_depth=15)

    np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
  • 贝叶斯调参

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    from bayes_opt import BayesianOptimization

    def rf_cv(num_leaves, max_depth, subsample, min_child_samples):
    val = cross_val_score(
    LGBMRegressor(objective = 'regression_l1',
    num_leaves=int(num_leaves),
    max_depth=int(max_depth),
    subsample = subsample,
    min_child_samples = int(min_child_samples)
    ),
    X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)
    ).mean()
    return 1 - val

    rf_bo = BayesianOptimization(
    rf_cv,
    {
    'num_leaves': (2, 100),
    'max_depth': (2, 100),
    'subsample': (0.1, 1),
    'min_child_samples' : (2, 100)
    }
    )

    rf_bo.maximize()
    1 - rf_bo.max['target']

还采用了一些基本方法来提高预测的精度

1
2
plt.figure(figsize=(13,5))
sns.lineplot(x=['0_origin','1_log_transfer','2_L1_&_L2','3_change_model','4_parameter_turning'], y=[1.36 ,0.19, 0.19, 0.14, 0.13])

本文标题:Task4_模型调参

文章作者:ZQ Liu

发布时间:2020年04月01日 - 11:28:03

最后更新:2020年07月27日 - 09:58:12

原始链接:http://yoursite.com/2020/04/01/Task4-%E6%A8%A1%E5%9E%8B%E8%B0%83%E5%8F%82/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

-------------本文结束感谢您的阅读-------------

欢迎关注我的其它发布渠道