Resources:
Resource:
Follow wiki for Coral Service
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 |
| 1 | 5 | 6 | NaN | 8 |
| 2 | 0 | 11 | 12 | NaN |
df.isnull().sum()
Use df.value to access the underlying NumPy array.
df.dropna()
Drop NaN values(rows/cols)
# only drop rows where all columns are NaN
df.dropna(how='all')
# drop rows that have not at least 4 non-NaN values
df.dropna(thresh=4)
# only drop rows where NaN appear in specific columns (here: 'C')
df.dropna(subset=['C'])
df.dropna(axis=1)
drop rows vs drop columns:
drop rows may include overfitting as it will lose valueable data. while drop columns amy include underfitting as it will reduce features.
Just drop NaN values may lose too many values, so we can estimate the missing values from the other training samples.
we simply replace the missing value by the mean value of the entire feature column. We can use from sklearn.preprocessing import Imputer to do that.
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df)
imputed_data = imr.transform(df.values)
imputed_data
Options:
axis to 0/1meanmost_frequent: which is mostly used for categorical feature valuesmedianmice:multivariate imputation via chained equation. 假设missing at random (MAR),也就是说数据缺失的概率仅与其他观察值有关,所以可以通过预测进行估计。这是一种参数型方法,对于不同的缺失值变量采用不同的回归或者其他方法进行imputation
Used to transfer data. There are 2 main essential methods. fit & transfer
