比赛网址

模型

ftrl + fm + lr + gbdt + rnn + ridge 打到了top 4%, 78/2091,继续加油。

定义评估函数RMSLE

RMSLE的评估函数如下:

注意:该评估函数对欠预测的惩罚大于过预测。

def rmsle(y, y_pred):
    assert len(y) == len(y_pred)
    to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    return (sum(to_sum) * (1.0/len(y))) ** 0.5

加载数据

train = pd.read_table("../input/train.tsv")
test = pd.read_table("../input/test.tsv")

处理类别特征

category_split_result = all_df.category_name.str.split("/", expand=True).astype(str)
all_df['cat1'] = category_split_result[0]
all_df['cat2'] = category_split_result[1]
all_df['cat3'] = category_split_result[2]

le = LabelEncoder()

le.fit(np.hstack([train.category_name, test.category_name]))
train.category_name = le.transform(train.category_name)
test.category_name = le.transform(test.category_name)

le.fit(np.hstack([train.brand_name, test.brand_name]))
train.brand_name = le.transform(train.brand_name)
test.brand_name = le.transform(test.brand_name)
del le

文本转序列

from keras.preprocessing.text import Tokenizer
raw_text = np.hstack([train.item_description.str.lower(), train.name.str.lower()])

print("   Fitting tokenizer...")
tok_raw = Tokenizer()
tok_raw.fit_on_texts(raw_text)
print("   Transforming text to seq...")

train["seq_item_description"] = tok_raw.texts_to_sequences(train.item_description.str.lower())
test["seq_item_description"] = tok_raw.texts_to_sequences(test.item_description.str.lower())
train["seq_name"] = tok_raw.texts_to_sequences(train.name.str.lower())
test["seq_name"] = tok_raw.texts_to_sequences(test.name.str.lower())

数据集分割

dtrain, dvalid = train_test_split(train, random_state=123, train_size=0.99)

参考