A quick introduction to machine learning#

These notebooks are a brief, hands-on, introduction to machine learning. We will revise some of the nomenclature, principles, and applications from Valentina’s presentation.

What is Machine Learning (ML)?#

Caveat: I’m not a Staticician, Mathematicial, or ML expert. I only play one online. You can find my work on plays like “How to get by with little to no data” or “Oh gosh, the PI wants some buzz-words in the report.”

What is ML (a personal point of view):

Focus on practical problems
Learn from the data and/or make predictions with it
Middle ground between statistics and optimization techniques
We have fast computers now, right? Let them do the work! (Must see JVP talk on this.)

Oversimplified take: Fit a model to data and use it to make predictions. (This is how scikit-learn designed its API BTW).

Vocabulary#

model: Mathematical equations used to approximate the data.
parameters: Variables that define the model and control its behavior.
labels/classes: Quantity/category that we want to predict
features: Observations (information) used as predictors of labels/classes.
training: Use features and known labels/classes to fit the model estimate its parameters (full circle, right? But why stop now?).
hyper-parameters: Variables that influence the training and the model but are not estimated during training.
unsupervised learning: Extract information and structure from the data without “training”. We will see clustering, and Principal Component Analysis (PCA).
supervised learning: Fit a model using data to “train” it for making predictions. Examples: regression, classification, spam detection, recommendation systems. We’ll see KNN, a classification type of ML.

Unsupervised: PCA#

The dataset we will use was consists of Red, Green, Blue (parameters) composites from plastic pellets photos. We Also have some extra information on the pellet size, shape, etc.

The labels are the yellowing index. The goal is to predict the yellowing based the pellets image. broken down to its RGB info,

import pandas as pd

df = pd.read_csv("pellets-visual-classes-rgb.csv", index_col="image").dropna()
df["yellowing index"] = df["yellowing index"].astype(int)

df

import matplotlib.pyplot as plt

fig, axes = plt.subplots(figsize=(11, 11), nrows=2, ncols=2)

axes = axes.ravel()

df["erosion"].value_counts().plot.barh(ax=axes[0], title="erosion")
df["color"].value_counts().plot.barh(ax=axes[1], title="color")
df["description"].value_counts().plot.barh(ax=axes[2], title="description")
df["yellowing"].value_counts().plot.barh(ax=axes[3], title="yellowing")

axes[1].yaxis.set_label_position("right")
axes[1].yaxis.tick_right()

axes[3].yaxis.set_label_position("right")
axes[3].yaxis.tick_right()

We will be using only the R, G, B data for now.

RGB = df[["r", "g", "b"]]

import numpy as np
import seaborn

corr = RGB.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True

fig, ax = plt.subplots()

# Draw the heatmap with the mask and correct aspect ratio
vmax = np.abs(corr.values[~mask]).max()
seaborn.heatmap(
    corr,
    mask=mask,
    cmap=plt.cm.PuOr,
    vmin=-vmax,
    vmax=vmax,
    square=True,
    linecolor="lightgray",
    linewidths=1,
    ax=ax,
)

for k in range(len(corr)):
    ax.text(
        k + 0.5,
        len(corr) - (k + 0.5),
        corr.columns[k],
        ha="center",
        va="center",
        rotation=45,
    )
    for j in range(k + 1, len(corr)):
        s = "{:.3f}".format(corr.values[k, j])
        ax.text(j + 0.5, len(corr) - (k + 0.5), s, ha="center", va="center")
ax.axis("off")

The first step to most ML techniques is to standardize the data. We do not want high variance data to bias our model.

def z_score(x):
    return (x - x.mean()) / x.std()

zs = RGB.apply(z_score).T

zs.std(axis=1)  # Should be 1

zs.mean(axis=1)  # Should be zero

from sklearn.decomposition import PCA

pca = PCA(n_components=None)
pca.fit(zs)

The pca object, or fitted model, was designed before pandas existed and it is based on numpy arrays. We can do better nowadays and add meaniful labels to it.

loadings = pd.DataFrame(pca.components_.T)
loadings.index = ["PC %s" % pc for pc in loadings.index + 1]
loadings.columns = ["TS %s" % pc for pc in loadings.columns + 1]
loadings

PCs = np.dot(loadings.values.T, RGB)

line = {"linewidth": 1, "linestyle": "--", "color": "k"}
marker = {
    "linestyle": "none",
    "marker": "o",
    "markersize": 7,
    "color": "blue",
    "alpha": 0.5,
}


fig, ax = plt.subplots(figsize=(7, 2.75))
ax.plot(PCs[0], PCs[1], label="Scores", **marker)

ax.set_xlabel("PC1")
ax.set_ylabel("PC2")

text = [ax.text(x, y, t) for x, y, t in zip(PCs[0], PCs[1] + 0.5, RGB.columns)]

perc = pca.explained_variance_ratio_ * 100

perc = pd.DataFrame(
    perc,
    columns=["Percentage explained ratio"],
    index=["PC %s" % pc for pc in np.arange(len(perc)) + 1],
)
ax = perc.plot(kind="bar")

The non-project loadings plot can help us see if the data has some sort of aggregation that we can leverage.

common = {"linestyle": "none", "markersize": 7, "alpha": 0.5}

markers = {
    0: {"color": "black", "marker": "o", "label": "no yellowing"},
    1: {"color": "red", "marker": "^", "label": "low"},
    2: {"color": "blue", "marker": "*", "label": "moderate"},
    3: {"color": "khaki", "marker": "s", "label": "high"},
    4: {"color": "darkgoldenrod", "marker": "d", "label": "very high"},
}

fig, ax = plt.subplots(figsize=(7, 7))
for x, y, idx in zip(loadings.iloc[:, 0], loadings.iloc[:, 1], df["yellowing index"]):
    ax.plot(x, y, **common, **markers.get(idx))

ax.set_xlabel("non-projected PC1")
ax.set_ylabel("non-projected PC2")
ax.axis([-1, 1, -1, 1])
ax.axis([-0.25, 0.25, -0.4, 0.4])

# Trick to remove duplicate labels from the for-loop.
handles, labels = ax.get_legend_handles_labels()
by_label = dict(zip(labels, handles))
ax.legend(by_label.values(), by_label.keys())

Summary#

PCA is probably be most robust, and easy to perform, non-supervised ML technique (it has been a common technique in ocean sciences since before the ML hype);
We learned that a single RGB value does not have enough predictive power to be used alone, we’ll need at least a combination of Red and Green;
The loading plot show that the moderate and the low yellowing have some overlaps and that can be troublesome when using this model for predictions.