把字符串转化成onehot的数组:
其实应该分为两步:
step 1:label encode 把字符转数字
step 2:one hot encode 把数字转onehot数组
具体方法:
* 可以用tf自带的onehot encode方法,但是tensorflow的one_hot只接受数字输入,还需要先把字符串转数字
* 可以用sklearn的onehot encode,先用LabelEncodeer(),X需要转成array,不能是dataframe
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dfle = df
dfle.column1 = le.fit_transform(dfle.column1)
X = dfle[['column1','column2']].values # X need to be two dimension array not dataframe
y = dfle.column3
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categorical_features=[0])
# 添加categorical_features=[0]避免所有X的列都被one hot encode
ohe.fit_transform(X).toarray()
# drop a column to prevent dummy variable trap
X = X[:,1:]
model.fit(X,y)
model.predict([...])
* 用pandas的pd.get_dummies(dataframe.column)可以直接获得onehot数组
GET DUMMY
vendor_dummy = pd.get_dummies(train_df.vendor) # get onehot with pandas
name_dummy = pd.get_dummies(train_df.name)
# gonna drop one column from dummy to prevent dummy variable trap
vendor_dummy = vendor_dummy.drop([vendor_dummy.columns[-1]], axis="columns")
name_dummy = name_dummy.drop([name_dummy.columns[-1]], axis="columns")
merged = pd.concat([train_df.iloc[:,2:8], vendor_dummy, name_dummy], axis="columns")
pandas的dataframe.iloc[0:7,]从第0列选到第7列,但是其实是前闭后开的[0,7)!!!
pandas的连接用pd.concat([df.column,df.column], axis=”columns”) 重要:需要加axis指定columns或rows
sklearn打分:model.score(X,y)