# coding: utf8
import pandas as pd
import jieba
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
def text2words(text):
return "#BEGIN " + " ".join([word.lower() for word in jieba.cut(text)]) + " #END"
train = pd.read_csv("input/train.csv", header=None, sep="\t")
train[0] = train[0].apply(text2words)
tokenizer = Tokenizer(num_words=None)
tokenizer.fit_on_texts(train[0].tolist())
sequences = tokenizer.texts_to_sequences(train[0])
train_features = pad_sequences(sequences, maxlen=15)
001 算法实战 | 文本转word id list
算法实战相关文章
最近热门
最常浏览
- 016 推荐系统 | 排序学习(LTR - Learning To Rank)
- 偏微分符号
- i.i.d(又称IID)
- 利普希茨连续条件(Lipschitz continuity)
- (error) MOVED 原因和解决方案
- TextCNN详解
- 找不到com.google.protobuf.GeneratedMessageV3的类文件
- Deployment failed: repository element was not specified in the POM inside distributionManagement
- cannot access com.google.protobuf.GeneratedMessageV3 解决方案
- CLUSTERDOWN Hash slot not served 问题原因和解决办法
×