把字符串转化成onehot的数组:

其实应该分为两步:
step 1:label encode 把字符转数字
step 2:one hot encode 把数字转onehot数组

具体方法:
* 可以用tf自带的onehot encode方法,但是tensorflow的one_hot只接受数字输入,还需要先把字符串转数字
* 可以用sklearn的onehot encode,先用LabelEncodeer(),X需要转成array,不能是dataframe

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

dfle = df
dfle.column1 = le.fit_transform(dfle.column1)

X = dfle[['column1','column2']].values # X need to be two dimension array not dataframe
y = dfle.column3

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categorical_features=[0])
# 添加categorical_features=[0]避免所有X的列都被one hot encode
ohe.fit_transform(X).toarray()

# drop a column to prevent dummy variable trap
X = X[:,1:]

model.fit(X,y)

model.predict([...])

* 用pandas的pd.get_dummies(dataframe.column)可以直接获得onehot数组

GET DUMMY

vendor_dummy = pd.get_dummies(train_df.vendor) # get onehot with pandas
name_dummy = pd.get_dummies(train_df.name)

# gonna drop one column from dummy to prevent dummy variable trap
vendor_dummy = vendor_dummy.drop([vendor_dummy.columns[-1]], axis="columns")
name_dummy = name_dummy.drop([name_dummy.columns[-1]], axis="columns")

merged = pd.concat([train_df.iloc[:,2:8], vendor_dummy, name_dummy], axis="columns")


pandas的dataframe.iloc[0:7,]从第0列选到第7列,但是其实是前闭后开的[0,7)!!!

pandas的连接用pd.concat([df.column,df.column], axis=”columns”) 重要:需要加axis指定columns或rows

sklearn打分:model.score(X,y)

Leave a comment

您的电子邮箱地址不会被公开。