# A quick introduction to machine learning#

These notebooks are a brief, hands-on, introduction to machine learning. We will revise some of the nomenclature, principles, and applications from Valentina’s presentation.

## What is Machine Learning (ML)?#

**Caveat:** I’m not a Staticician, Mathematicial, or ML expert. I only play one online. You can find my work on plays like “How to get by with little to no data” or “Oh gosh, the PI wants some buzz-words in the report.”

What is ML (a personal point of view):

Focus on practical problems

Learn from the data and/or make predictions with it

Middle ground between statistics and optimization techniques

We have fast computers now, right? Let them do the work! (Must see JVP talk on this.)

**Oversimplified take:** Fit a model to data and use it to make predictions. (This is how scikit-learn designed its API BTW).

## Vocabulary#

**model:**Mathematical equations used to approximate the data.**parameters:**Variables that define the model and control its behavior.**labels/classes:**Quantity/category that we want to predict**features:**Observations (information) used as predictors of labels/classes.**training:**Use**features**and known**labels/classes**to fit the**model**estimate its**parameters**(full circle, right? But why stop now?).**hyper-parameters:**Variables that influence the training and the model but are not estimated during training.**unsupervised learning:**Extract information and structure from the data without “training”. We will see clustering, and Principal Component Analysis (PCA).**supervised learning:**Fit a model using data to “train” it for making predictions. Examples: regression, classification, spam detection, recommendation systems. We’ll see KNN, a classification type of ML.

## Unsupervised: PCA#

The dataset we will use was consists of Red, Green, Blue (**parameters**) composites from plastic pellets photos. We Also have some extra information on the pellet size, shape, etc.

The **labels** are the yellowing index. The goal is to predict the yellowing based the pellets image. broken down to its RGB info,

```
import pandas as pd
df = pd.read_csv("pellets-visual-classes-rgb.csv", index_col="image").dropna()
df["yellowing index"] = df["yellowing index"].astype(int)
df
```

```
import matplotlib.pyplot as plt
fig, axes = plt.subplots(figsize=(11, 11), nrows=2, ncols=2)
axes = axes.ravel()
df["erosion"].value_counts().plot.barh(ax=axes[0], title="erosion")
df["color"].value_counts().plot.barh(ax=axes[1], title="color")
df["description"].value_counts().plot.barh(ax=axes[2], title="description")
df["yellowing"].value_counts().plot.barh(ax=axes[3], title="yellowing")
axes[1].yaxis.set_label_position("right")
axes[1].yaxis.tick_right()
axes[3].yaxis.set_label_position("right")
axes[3].yaxis.tick_right()
```

We will be using only the R, G, B data for now.

```
RGB = df[["r", "g", "b"]]
```

```
import numpy as np
import seaborn
corr = RGB.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
fig, ax = plt.subplots()
# Draw the heatmap with the mask and correct aspect ratio
vmax = np.abs(corr.values[~mask]).max()
seaborn.heatmap(
corr,
mask=mask,
cmap=plt.cm.PuOr,
vmin=-vmax,
vmax=vmax,
square=True,
linecolor="lightgray",
linewidths=1,
ax=ax,
)
for k in range(len(corr)):
ax.text(
k + 0.5,
len(corr) - (k + 0.5),
corr.columns[k],
ha="center",
va="center",
rotation=45,
)
for j in range(k + 1, len(corr)):
s = "{:.3f}".format(corr.values[k, j])
ax.text(j + 0.5, len(corr) - (k + 0.5), s, ha="center", va="center")
ax.axis("off")
```

The first step to most ML techniques is to standardize the data. We do not want high variance data to bias our model.

```
def z_score(x):
return (x - x.mean()) / x.std()
zs = RGB.apply(z_score).T
zs.std(axis=1) # Should be 1
```

```
zs.mean(axis=1) # Should be zero
```

```
from sklearn.decomposition import PCA
pca = PCA(n_components=None)
pca.fit(zs)
```

The pca object, or fitted model, was designed before pandas existed and it is based on numpy arrays. We can do better nowadays and add meaniful labels to it.

```
loadings = pd.DataFrame(pca.components_.T)
loadings.index = ["PC %s" % pc for pc in loadings.index + 1]
loadings.columns = ["TS %s" % pc for pc in loadings.columns + 1]
loadings
```

```
PCs = np.dot(loadings.values.T, RGB)
```

```
line = {"linewidth": 1, "linestyle": "--", "color": "k"}
marker = {
"linestyle": "none",
"marker": "o",
"markersize": 7,
"color": "blue",
"alpha": 0.5,
}
fig, ax = plt.subplots(figsize=(7, 2.75))
ax.plot(PCs[0], PCs[1], label="Scores", **marker)
ax.set_xlabel("PC1")
ax.set_ylabel("PC2")
text = [ax.text(x, y, t) for x, y, t in zip(PCs[0], PCs[1] + 0.5, RGB.columns)]
```

```
perc = pca.explained_variance_ratio_ * 100
perc = pd.DataFrame(
perc,
columns=["Percentage explained ratio"],
index=["PC %s" % pc for pc in np.arange(len(perc)) + 1],
)
ax = perc.plot(kind="bar")
```

The non-project loadings plot can help us see if the data has some sort of aggregation that we can leverage.

```
common = {"linestyle": "none", "markersize": 7, "alpha": 0.5}
markers = {
0: {"color": "black", "marker": "o", "label": "no yellowing"},
1: {"color": "red", "marker": "^", "label": "low"},
2: {"color": "blue", "marker": "*", "label": "moderate"},
3: {"color": "khaki", "marker": "s", "label": "high"},
4: {"color": "darkgoldenrod", "marker": "d", "label": "very high"},
}
fig, ax = plt.subplots(figsize=(7, 7))
for x, y, idx in zip(loadings.iloc[:, 0], loadings.iloc[:, 1], df["yellowing index"]):
ax.plot(x, y, **common, **markers.get(idx))
ax.set_xlabel("non-projected PC1")
ax.set_ylabel("non-projected PC2")
ax.axis([-1, 1, -1, 1])
ax.axis([-0.25, 0.25, -0.4, 0.4])
# Trick to remove duplicate labels from the for-loop.
handles, labels = ax.get_legend_handles_labels()
by_label = dict(zip(labels, handles))
ax.legend(by_label.values(), by_label.keys())
```

## Summary#

PCA is probably be most robust, and easy to perform, non-supervised ML technique (it has been a common technique in ocean sciences since before the ML hype);

We learned that a single RGB value does not have enough predictive power to be used alone, we’ll need at least a combination of Red and Green;

The loading plot show that the moderate and the low yellowing have some overlaps and that can be troublesome when using this model for predictions.