We have the human operator labels for our data, so we can train models to predict these labels instead of using them only to check our model.
Examples of supervised learning include linear regression, support vector machines, neural networks (including deep learning), and more.
The simplest possible supervised learning model for our classification problems is a k-nearest-neighbors classifier (kNN).
kNN assigns labels based on the labels of a sample’s k nearest neighbors. Let’s try it out!
import pandas as pd df = pd.read_csv("pellets-visual-classes-rgb.csv", index_col="image").dropna() df["yellowing index"] = df["yellowing index"].astype(int)
We already know that the size is mostly random so let’s drop it here.
feature_columns = ["r", "g", "b"] X = df[feature_columns].values y = df["yellowing"].values
from sklearn import neighbors knn = neighbors.KNeighborsClassifier() knn.fit(X, y) prediction = knn.predict(X) prediction
from sklearn import metrics metrics.accuracy_score(y, prediction)
Quite the improvement from our k-means attempt. Aren’t we forgeting anything though? Yes, we should always standardize the data!
from sklearn import preprocessing scaler = preprocessing.StandardScaler() scaler.fit(X) X_scaled = scaler.transform(X)
knn = neighbors.KNeighborsClassifier() knn.fit(X_scaled, y) prediction_scaled = knn.predict(X_scaled) metrics.accuracy_score(y, prediction_scaled)
The lower score means that we where overfitting before standardizing. Still, ~78% is much better than our k-means.
import seaborn as sns redux = df[["r", "g", "b", "yellowing"]] redux = redux.assign(knn=prediction_scaled) sns.pairplot(redux, hue="knn", vars=feature_columns)
How can we stop forgetting to standardize the data? Well, scikit-learn is awesome and has our back. We can create data processing pipelines and keep all the steps of our model in a single object. Pipelines are very robust and may contain custom steps if your data requires them
from sklearn import pipeline classifier = pipeline.make_pipeline( preprocessing.StandardScaler(), neighbors.KNeighborsClassifier(), ) classifier.fit(X, y) prediction_pipeline = classifier.predict(X) metrics.accuracy_score(y, prediction_pipeline)
from sklearn import model_selection split = model_selection.train_test_split(X, y) X_train, X_test, y_train, y_test = split X_train.shape, X_test.shape
scores = model_selection.cross_val_score(classifier, X, y) scores
We reduce our accuracy when performing a test/train split. Why that happened? The first guess is that our model may be “data hungry.” We just don’t have enough samples on each class to predict them.
The data is balanced! That can we do next?
Try to balance the current data;
Collect more data and see if the dataset balance itself out;
Choose a technique that is more robust to unbalanced data, like Decision Trees (DT).
PS: Check this awesome paper on DTs.