{ "cells": [ { "cell_type": "markdown", "id": "ea0786c2", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# A quick introduction to machine learning\n", "\n", "\n", "- These notebooks are a brief, hands-on, introduction to machine learning.\n", "- We will revise some of the nomenclature, principles, and applications from [Valentina's presentation](https://github.com/oceanhackweek/ohw-tutorials/tree/OHW22/01-Tue/01-machine-learning-intro)." ] }, { "cell_type": "markdown", "id": "2a4c4301", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## ML will solve all of our problems, right?\n", "\n", "![](cow.jpg)" ] }, { "cell_type": "markdown", "id": "987fda81", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## What is Machine Learning (ML)?\n", "\n", "**Caveat:** I'm not a Statistician, Mathematician, or ML expert. I only play one online. You can find my work on movies like \"How to get by with little to no data\" or \"Oh gosh, the PI wants some buzz-words in the report\" and \"Fuzzy logic no longer does it, we need ML → AI → DL\"\n", "\n", "What is ML (a personal point of view):\n", "\n", "* Focus on practical problems\n", "* Learn from the data and/or make predictions with it\n", "* Middle ground between statistics and optimization techniques\n", "* We have fast computers now, right? Let them do the work! ([Must see JVP talk on this](https://www.youtube.com/watch?app=desktop&v=Iq9DzN6mvYA).)" ] }, { "cell_type": "markdown", "id": "00cc5ce5", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Oversimplified take:** Fit a model to data and use it to make predictions. (This is how scikit-learn designed its API BTW)." ] }, { "cell_type": "markdown", "id": "b2cf395e", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Vocabulary \n", "\n", "\n", "- **parameters:** Variables that define the model and control its behavior.\n", "\n", "- **model:** Set of mathematical equations used to approximate the data.\n", "\n", "- **labels/classes:** Quantity/category that we want to predict\n", "\n", "- **features:** Observations (information) used as predictors of labels/classes.\n", "\n", "- **training:** Use **features** and known **labels/classes** to fit the **model** estimate its **parameters** (full circle, right? But why stop now?).\n", "\n", "Please check out [this awesome lecture](https://docs.google.com/presentation/d/1Fa9SuyK9DIpd-MkJJjGqjCbAa-sHtr3qufC9MhmewDQ/edit) on ML for climate science." ] }, { "cell_type": "markdown", "id": "ed4313fa", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- **hyper-parameters:** Variables that influence the **training** and the **model** but are not estimated during training.\n", "- **unsupervised learning:** Extract information and structure from the data without **training** with known **labels**. We will see clustering, and Principal Component Analysis (PCA).\n", "\n", "- **supervised learning:** Fit a model using data to \"train\" it for making predictions. Examples: regression, classification, spam detection, recommendation systems. We'll see KNN, a classification type of ML in this tutorial." ] }, { "cell_type": "markdown", "id": "97fe276c", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Unsupervised: PCA\n", "\n", "The dataset we will use was consists of Red, Green, Blue composites (**parameters**) from plastic pellets photos. We also have some extra information on the pellet size, shape, etc.\n", "\n", "The **labels** are the yellowing index. The goal is to predict the yellowing based the pellets image, broken down to its RGB info." ] }, { "cell_type": "code", "execution_count": 1, "id": "6f449af4", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "048ff44d", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rgbsize (mm)colordescriptionerosionerosion indexyellowingyellowing index
image
cl1_p11_moca2_deixa5_a00011521501434.021transparentspherehigh erosion3low1
cl1_p12_lagoinha_deixa1_g00062212182194.244whitelight erosionlow erosion1low1
cl1_p12_lagoinha_deixa1_g00071401371293.946whitenot erosionlow erosion1low1
cl1_p12_lagoinha_deixa1_g00081881781463.948whitemoderate erosionhigh erosion3moderate2
cl1_p12_lagoinha_deixa2_h00042072001896.043whitelight erosionlow erosion1moderate2
.................................
cl1_p6_moca2_deixa3_a00061861931554.546transparentcylindermoderate erosion2low1
cl1_p8_moca2_deixa5_b00011691681063.082transparentspherelow erosion1low1
cl1_p8_moca2_deixa5_b00031911891523.932whitespherelow erosion1low1
cl1_p8_moca2_deixa5_b0004181156703.230whitespheremoderate erosion3moderate2
cl1_p9_moca2_deixa5_b00011931921983.763transparentspherehigh erosion3low1
\n", "

127 rows × 10 columns

\n", "