# coding: utf8
import pandas as pd
import jieba
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
def text2words(text):
return "#BEGIN " + " ".join([word.lower() for word in jieba.cut(text)]) + " #END"
train = pd.read_csv("input/train.csv", header=None, sep="\t")
train[0] = train[0].apply(text2words)
tokenizer = Tokenizer(num_words=None)
tokenizer.fit_on_texts(train[0].tolist())
sequences = tokenizer.texts_to_sequences(train[0])
train_features = pad_sequences(sequences, maxlen=15)
001 算法实战 | 文本转word id list
算法实战相关文章
最近热门
- 7.1.1 设置spark.driver.maxResultSize
- C++ 整形数转字符串
- 论文:Capturing Delayed Feedback in Conversion Rate Prediction via Elapsed-Time Sampling
- Minimum Detectable Effect(MDE)最小可检测效应
- werkzeug ImportError: cannot import name 'secure_filename'
- SFT(Supervised Fine-Tuning,即有监督微调)
- STT模型(Speech-to-Text)
- 论文阅读 TOKEN MERGING: YOUR VIT BUT FASTER(ToMe模型)
- 因果推断 | uplift | 营销增长 | 增长算法 | 智能营销
- SSB - Sample Selection Bias - 样本选择偏差问题
最常浏览
- 016 推荐系统 | 排序学习(LTR - Learning To Rank)
- 偏微分符号
- i.i.d(又称IID)
- 利普希茨连续条件(Lipschitz continuity)
- (error) MOVED 原因和解决方案
- TextCNN详解
- 找不到com.google.protobuf.GeneratedMessageV3的类文件
- Deployment failed: repository element was not specified in the POM inside distributionManagement
- cannot access com.google.protobuf.GeneratedMessageV3 解决方案
- CLUSTERDOWN Hash slot not served 问题原因和解决办法
×