比赛网址
模型
ftrl + fm + lr + gbdt + rnn + ridge 打到了top 4%, 78/2091,继续加油。
定义评估函数RMSLE
RMSLE的评估函数如下:
注意:该评估函数对欠预测的惩罚大于过预测。
def rmsle(y, y_pred):
assert len(y) == len(y_pred)
to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
return (sum(to_sum) * (1.0/len(y))) ** 0.5
加载数据
train = pd.read_table("../input/train.tsv")
test = pd.read_table("../input/test.tsv")
处理类别特征
category_split_result = all_df.category_name.str.split("/", expand=True).astype(str)
all_df['cat1'] = category_split_result[0]
all_df['cat2'] = category_split_result[1]
all_df['cat3'] = category_split_result[2]
le = LabelEncoder()
le.fit(np.hstack([train.category_name, test.category_name]))
train.category_name = le.transform(train.category_name)
test.category_name = le.transform(test.category_name)
le.fit(np.hstack([train.brand_name, test.brand_name]))
train.brand_name = le.transform(train.brand_name)
test.brand_name = le.transform(test.brand_name)
del le
文本转序列
from keras.preprocessing.text import Tokenizer
raw_text = np.hstack([train.item_description.str.lower(), train.name.str.lower()])
print(" Fitting tokenizer...")
tok_raw = Tokenizer()
tok_raw.fit_on_texts(raw_text)
print(" Transforming text to seq...")
train["seq_item_description"] = tok_raw.texts_to_sequences(train.item_description.str.lower())
test["seq_item_description"] = tok_raw.texts_to_sequences(test.item_description.str.lower())
train["seq_name"] = tok_raw.texts_to_sequences(train.name.str.lower())
test["seq_name"] = tok_raw.texts_to_sequences(test.name.str.lower())
数据集分割
dtrain, dvalid = train_test_split(train, random_state=123, train_size=0.99)