{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"execution": {},
"id": "view-in-github"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"# Tutorial 1: The problem of changing data distributions\n",
"\n",
"**Week 2, Day 4: Macro-Learning**\n",
"\n",
"**By Neuromatch Academy**\n",
"\n",
"__Content creators:__ Hlib Solodzhuk, Ximeng Mao, Grace Lindsay\n",
"\n",
"__Content reviewers:__ Aakash Agrawal, Alish Dipani, Hossein Rezaei, Yousef Ghanbari, Mostafa Abdollahi, Hlib Solodzhuk, Ximeng Mao, Samuele Bolotta, Grace Lindsay\n",
"\n",
"__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"___\n",
"\n",
"\n",
"# Tutorial Objectives\n",
"\n",
"*Estimated timing of tutorial: 30 minutes*\n",
"\n",
"In this tutorial, we will explore the problems that arise from *distribution shifts*. Distribution shifts occur when the testing data distribution deviates from the training data distribution; that is, when a model is evaluated on data that somehow differs from what it was trained on.\n",
"\n",
"There are many ways that testing data can differ from training data. Two broad categories of distribution shifts are: **covariate shift** and **concept shift**.\n",
"\n",
"In covariate shift, the distribution of input features changes. For example, consider a dog/cat classification task where the model was trained to differentiate between real photos of pets while the testing dataset consists entirely of cartoon characters.\n",
"\n",
"Concept shift, as its name suggests, involves a conceptual change in the relationship between features and the desired output. For example, a recommendation system may learn a user's preferences, but then those preferences change.\n",
"\n",
"We will explore both types of shifts using a simple function that represents the relationship between the day of the year and the price of fruits in the local market!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @markdown\n",
"from IPython.display import IFrame\n",
"from ipywidgets import widgets\n",
"out = widgets.Output()\n",
"with out:\n",
" print(f\"If you want to download the slides: https://osf.io/download/t36w8/\")\n",
" display(IFrame(src=f\"https://mfr.ca-1.osf.io/render?url=https://osf.io/t36w8/?direct%26mode=render%26action=download%26mode=render\", width=730, height=410))\n",
"display(out)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Setup\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install and import feedback gadget\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Install and import feedback gadget\n",
"\n",
"!pip install vibecheck datatops --quiet\n",
"\n",
"from vibecheck import DatatopsContentReviewContainer\n",
"def content_review(notebook_section: str):\n",
" return DatatopsContentReviewContainer(\n",
" \"\", # No text prompt\n",
" notebook_section,\n",
" {\n",
" \"url\": \"https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab\",\n",
" \"name\": \"neuromatch_neuroai\",\n",
" \"user_key\": \"wb2cxze8\",\n",
" },\n",
" ).render()\n",
"\n",
"\n",
"feedback_prefix = \"W2D4_T1\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Imports\n",
"\n",
"#working with data\n",
"import numpy as np\n",
"\n",
"#plotting\n",
"import matplotlib.pyplot as plt\n",
"import logging\n",
"\n",
"#interactive display\n",
"import ipywidgets as widgets\n",
"\n",
"#modeling\n",
"from sklearn.neural_network import MLPRegressor\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Figure settings\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Figure settings\n",
"\n",
"logging.getLogger('matplotlib.font_manager').disabled = True\n",
"\n",
"%matplotlib inline\n",
"%config InlineBackend.figure_format = 'retina' # perfrom high definition rendering for images and plots\n",
"plt.style.use(\"https://raw.githubusercontent.com/NeuromatchAcademy/course-content/main/nma.mplstyle\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plotting functions\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Plotting functions\n",
"\n",
"def predict(days, prices, summer_days_mean, summer_days_std, start_month, end_month):\n",
" \"\"\"\n",
" Predicts the prices for a given date range and plots the true and predicted prices.\n",
"\n",
" Inputs:\n",
" - days (np.ndarray): input days to predict the prices on.\n",
" - prices (np.ndarray): true price values.\n",
" - summer_days_mean (float): mean value of summer days.\n",
" - summer_days_std (float): standard deviation of summer days.\n",
" - start_month (str): The starting month for the prediction range.\n",
" - end_month (str): The ending month for the prediction range.\n",
"\n",
" Raises:\n",
" - ValueError: If the specified start_month is after the end_month.\n",
" \"\"\"\n",
"\n",
" #check for feasibility of prompt:)\n",
" if months.index(start_month) > months.index(end_month):\n",
" raise ValueError(\"Please enter valid month interval.\")\n",
"\n",
" #find days and prices for the selected interval\n",
" selected_days = np.expand_dims(days[int(np.sum(months_durations[:months.index(start_month)])) : int(np.sum(months_durations[:months.index(end_month)+1]))], 1)\n",
" selected_prices = prices[int(np.sum(months_durations[:months.index(start_month)])) : int(np.sum(months_durations[:months.index(end_month)+1]))]\n",
"\n",
" #normalize selected days\n",
" selected_days_norm = (selected_days - summer_days_mean) / summer_days_std\n",
"\n",
" #evaluate MLP on normalized selected data\n",
" print(f\"R-squared value is: {model.score(selected_days_norm, selected_prices):.02f}.\")\n",
"\n",
" #predict for selected dates\n",
" selected_prices_predictions = model.predict(selected_days_norm)\n",
"\n",
" #plot true and predicted data\n",
" with plt.xkcd():\n",
" plt.plot(selected_days, selected_prices, label = \"True Data\")\n",
" plt.scatter(selected_days, selected_prices_predictions, label = f\"From {start_month} to {end_month} Predictions\", marker='o', color='r', zorder=2)\n",
" plt.xlabel('Week')\n",
" plt.ylabel('Price')\n",
" plt.axvspan(days[151], days[242], facecolor='gray', alpha=0.1, label = \"Training period\") #add grey background for summer days to see training data\n",
" plt.xlim(np.min(selected_days), np.max(selected_days))\n",
" plt.legend()\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Helper functions\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Helper functions\n",
"\n",
"months = [\"January\", \"February\", \"March\", \"April\", \"May\", \"June\", \"July\", \"August\", \"September\", \"October\", \"November\", \"December\"]\n",
"\n",
"months_durations = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]\n",
"\n",
"days = np.arange(-26, 26 + 1/7, 1/7)\n",
"prices = .005 * days**2 + .1 * np.sin(np.pi * days) + 1\n",
"\n",
"start_month = \"January\"\n",
"end_month = \"December\"\n",
"\n",
"summer_days = np.expand_dims(days[151:243], 1)\n",
"summer_prices = prices[151:243]\n",
"summer_days_train, summer_days_test, summer_prices_train, summer_prices_test = train_test_split(summer_days, summer_prices, random_state = 42)\n",
"summer_days_mean, summer_days_std = np.mean(summer_days), np.std(summer_days)\n",
"summer_days_train_norm = (summer_days_train - summer_days_mean) / summer_days_std\n",
"summer_days_test_norm = (summer_days_test - summer_days_mean) / summer_days_std\n",
"model = MLPRegressor(hidden_layer_sizes=(100, 100), max_iter=10000, random_state = 42, solver = \"lbfgs\")\n",
"model.fit(summer_days_train_norm, summer_prices_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Video 1: Distribution shifts\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 1: Distribution shifts\n",
"\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"video_ids = [('Youtube', 'lsPtgn-mEps'), ('Bilibili', 'BV16w4m1v7wP')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_distribution_shifts_video\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"\n",
"# Section 1: Covariate shift\n",
"\n",
"In this section, we are going to discuss the case of covariate shift in distribution - when the distribution of features (usually denoted by $\\mathbf{x}$) differs in the training and testing data."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 1: Fitting pricing data to MLP\n",
"\n",
"In this exercise, we are going to train a Multi-Layer Perceptron (a fully-connected neural network with non-linear ReLU activation functions) to predict prices for fruits.\n",
"\n",
"As mentioned in the video, we will model the price of fruits with the following function:\n",
"\n",
"$$f(x) = A x^{2} + B sin(\\pi x + \\phi) + C$$\n",
"\n",
"This equation suggests quadratic annual behavior (with summer months being at the bottom of the parabola) as well as bi-weekly seasonality introduced by the $sin(\\pi x)$ term (with top values being the days where there is supply of fresh fruits to the market). Variables $A, B, \\phi \\:\\: \\text{and} \\:\\: C$ allow us to tune the day-price relation in different scenarios (for example, we will observe the role of $\\phi$ in the second section of the tutorial). For this particular case, let us set $A = 0.005$, $B = 0.1$, $\\phi = 0$ and $C = 1$.\n",
"\n",
"At first, let's take a look at the data - we will plot it.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"cellView": "form",
"colab_type": "text",
"execution": {},
"pycharm": {
"name": "#%%\n"
}
},
"source": [
"```python\n",
"#define variables\n",
"A = .005\n",
"B = 0.1\n",
"phi = 0\n",
"C = 1\n",
"\n",
"#define days (observe that those are not 1, ..., 365 but proxy ones to make model function neat)\n",
"days = np.arange(-26, 26 + 1/7, 1/7) #defined as fractions of a week\n",
"\n",
"###################################################################\n",
"## Fill out the following then remove\n",
"raise NotImplementedError(\"Student exercise: need to complete days-prices relation formula\")\n",
"###################################################################\n",
"prices = ... * ...**2 + ... * np.sin(np.pi * ... + ...) + ...\n",
"\n",
"#plot relation between days and prices\n",
"with plt.xkcd():\n",
" plt.plot(days, prices)\n",
" plt.xlabel('Week')\n",
" plt.ylabel('Price')\n",
" plt.show()\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# to_remove solution\n",
"\n",
"#define variables\n",
"A = .005\n",
"B = 0.1\n",
"phi = 0\n",
"C = 1\n",
"\n",
"#define days (observe that those are not 1, ..., 365 but proxy ones to make model function neat)\n",
"days = np.arange(-26, 26 + 1/7, 1/7) #defined as fractions of a week\n",
"\n",
"prices = A * days**2 + B * np.sin(np.pi * days + phi) + C\n",
"\n",
"#plot relation between days and prices\n",
"with plt.xkcd():\n",
" plt.plot(days, prices)\n",
" plt.xlabel('Week')\n",
" plt.ylabel('Price')\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Let's ensure that we indeed have 365 days and 26 local maxima (which equals the number of weeks divided by two, as we receive new supplies bi-weekly)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"print(f\"The number of days is {days.shape[0]}.\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"print(f\"The number of peaks is {np.sum((np.diff(prices)[:-1] > 0) & (np.diff(prices)[1:] < 0))}.\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Notice that x-axis values start from `-26` and up to `26`, where `-26` represents the start of the year season (January), `0` represents the middle of the year (June-July, having the lowest price of the fruits), and `26` ends the year with December. As we have 52 weeks in the year, the value explicitly describes the week before (if it comes with a minus sign) or after the week in the middle of the year."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"**Now, we are ready to train the model to predict the price of the fruits.** First, we will assume we have only been to the market during summer and so only have that data to train on. For this, we need to take data only from the summer months and feed it into the MLP.\n",
"\n",
"As usual, the data will be separated into two distinct classes: train and test. We will measure its performance using the R-squared metric. We will also normalize the data to provide better learning stability for the model.\n",
"\n",
"The MLP consists of two hidden layers, with 100 neurons in each. We use `LBFGS` as the solver for this particular scenario as it performs better compared to `Adam` or `SGD` when the number of data points is limited."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"#take only summer data\n",
"summer_days = np.expand_dims(days[151:243], 1)\n",
"summer_prices = prices[151:243]\n",
"\n",
"#divide data into train and test sets\n",
"summer_days_train, summer_days_test, summer_prices_train, summer_prices_test = train_test_split(summer_days, summer_prices, random_state = 42)\n",
"\n",
"###################################################################\n",
"## Fill out the following then remove\n",
"raise NotImplementedError(\"Student exercise: need to normalized days and to fit model with it\")\n",
"###################################################################\n",
"\n",
"#apply normalization for days\n",
"summer_days_mean, summer_days_std = np.mean(...), np.std(...)\n",
"summer_days_train_norm = (summer_days_train - ...) / ...\n",
"summer_days_test_norm = (summer_days_test - ...) / ...\n",
"\n",
"#define MLP\n",
"model = MLPRegressor(hidden_layer_sizes=(100, 100), max_iter=10000, random_state = 42, solver = \"lbfgs\") # LBFGS is better to use when there is small amount of data\n",
"\n",
"#train MLP\n",
"model.fit(..., ...)\n",
"\n",
"#evaluate MLP on test data\n",
"print(f\"R-squared value is: {model.score(summer_days_test_norm, summer_prices_test):.02f}.\")\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# to_remove solution\n",
"\n",
"#take only summer data\n",
"summer_days = np.expand_dims(days[151:243], 1)\n",
"summer_prices = prices[151:243]\n",
"\n",
"#divide data into train and test sets\n",
"summer_days_train, summer_days_test, summer_prices_train, summer_prices_test = train_test_split(summer_days, summer_prices, random_state = 42)\n",
"\n",
"#apply normalization for days\n",
"summer_days_mean, summer_days_std = np.mean(summer_days), np.std(summer_days)\n",
"summer_days_train_norm = (summer_days_train - summer_days_mean) / summer_days_std\n",
"summer_days_test_norm = (summer_days_test - summer_days_mean) / summer_days_std\n",
"\n",
"#define MLP\n",
"model = MLPRegressor(hidden_layer_sizes=(100, 100), max_iter=10000, random_state = 42, solver = \"lbfgs\") # LBFGS is better to use when there is small amount of data\n",
"\n",
"#train MLP\n",
"model.fit(summer_days_train_norm, summer_prices_train)\n",
"\n",
"#evaluate MLP on test data\n",
"print(f\"R-squared value is: {model.score(summer_days_test_norm, summer_prices_test):.02f}.\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Let's explore the predictions of the model visually."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Make sure you execute this cell to observe the plot!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Make sure you execute this cell to observe the plot!\n",
"\n",
"#predict for test set\n",
"summer_prices_test_predictions = model.predict(summer_days_test_norm)\n",
"\n",
"with plt.xkcd():\n",
" plt.plot(summer_days, summer_prices, label = \"True Data\")\n",
" plt.scatter(summer_days_test, summer_prices_test_predictions, label = \"Test Predictions\", marker='o', color='g', zorder=2)\n",
" plt.xlabel('Week')\n",
" plt.ylabel('Price')\n",
" plt.legend()\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Now that we've learned about prices during the summer, can we predict what prices will be in autumn? Let's see how well our summer-trained model performs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"#take only autumn data\n",
"autumn_days = np.expand_dims(days[243:334], 1)\n",
"autumn_prices = prices[243:334]\n",
"\n",
"#apply normalization (pay attention to the fact that we use summer metrics as the model was trained on them!)\n",
"autumn_days_norm = (autumn_days - summer_days_mean) / summer_days_std\n",
"\n",
"#evaluate MLP on normalized autumn data\n",
"print(f\"R-squared value is: {model.score(autumn_days_norm, autumn_prices):.02f}.\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"The R-squared value dropped significantly; let's observe what is going on visually."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Make sure you execute this cell to observe the plot!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Make sure you execute this cell to observe the plot!\n",
"\n",
"#predict for test set\n",
"autumn_prices_predictions = model.predict(autumn_days_norm)\n",
"\n",
"with plt.xkcd():\n",
" plt.plot(autumn_days, autumn_prices, label = \"True Data\")\n",
" plt.scatter(autumn_days, autumn_prices_predictions, label = \"Autumn Predictions\", marker='o', color='r', zorder=2)\n",
" plt.xlabel('Week')\n",
" plt.ylabel('Price')\n",
" plt.legend()\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Coding Exercise 1 Discussion\n",
"\n",
"1. How would you qualitatively evaluate the model's performance on autumn data? Does it capture the annual trend? Does it capture the weekly trend?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"#to_remove explanation\n",
"\n",
"\"\"\"\n",
"Discussion: How would you qualitatively evaluate the model's performance on autumn data?\n",
"Does it capture the annual trend? Does it capture the weekly trend?\n",
"\n",
"Model predictions are completely invariant to the weekly seasonality of the data, though\n",
"they somewhat capture the increasing trend. Thus, it definitely can't be used to make\n",
"quality predictions on a daily basis.\n",
"\"\"\";"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_fitting_pricing_data_to_mlp\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Interactive Demo 1: Covariate shift impact on predictability power\n",
"\n",
"In this interactive demo, you will explore how well the model generalizes to other months using two dropdown menus for selecting the start and end months for testing data.\n",
"\n",
"!N.B.: Note that for summer months, some training data will also be included in the evaluation of the prediction."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Make sure you execute this cell to enable the widget!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @markdown Make sure you execute this cell to enable the widget!\n",
"\n",
"@widgets.interact\n",
"def interactive_predict(start_month = widgets.Dropdown(description = \"Start Month\", options = months, value = \"June\"), end_month = widgets.Dropdown(description = \"End Month\", options = months, value = \"October\")):\n",
" predict(days, prices, summer_days_mean, summer_days_std, start_month, end_month)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Interactive Demo 1 Discussion\n",
"\n",
"1. Does the amount of covariate shift impact the model's performance? What happens at the borders of the training period—does the model still capture the dynamics right before and after it?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"#to_remove explanation\n",
"\n",
"\"\"\"\n",
"Discussion: Does the amount of covariate shift impact the model's performance?\n",
"What happens at the borders of the training period—does the model still capture the\n",
"dynamics right before and after it?\n",
"\n",
"Indeed, the bigger the covariate shift (the more distinct the days are), the worse\n",
"the performance we observe. In both border cases, the model performs poorly; what is\n",
"more - even on the fraction of training data near these regions, we can observe that\n",
"the model is going to lose the desired dynamics.\n",
"\"\"\";"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_covariate_shift_impact_on_predictability_power\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"\n",
"# Section 2: Concept shift\n",
"\n",
"Estimated time to reach this point from the start of the tutorial: 15 minutes\n",
"\n",
"In this section, we are going to explore another case of distribution shift, which is different in nature from covariate shift: concept shift."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Exercise 2: Change in the day of supply\n",
"\n",
"We observed how transitioning from summer to autumn introduces covariate shifts, but what would lead to a concept shift in our fruit price relationship? One possibility is a change in the day of the week when fresh fruits are delivered. Let's revisit the modeling equation:\n",
"\n",
"$$f(x) = A x^{2} + B sin(\\pi x + \\phi) + C$$\n",
"\n",
"Which variable, do you think, needs to be changed so that we now receive fresh fruits 2 days later than before?\n",
"\n",
"Answer
\n",
"
\n",
"Yes, indeed, it involves a sinusoidal phase shift — we only need to change the $\\phi$ value.\n",
" \n",
"\n",
"Let's take a look at how well the model generalizes to this unexpected change as well."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"shifted_phi = - 2 * np.pi / 7\n",
"shifted_prices = A * days**2 + B * np.sin(np.pi*days + shifted_phi) + C\n",
"\n",
"#plot relation between days, prices & shifted prices for June\n",
"with plt.xkcd():\n",
" plt.plot(days[151:181], prices[151:181], label = \"Original Data\")\n",
" plt.plot(days[151:181], shifted_prices[151:181], label = \"Shifted Data\")\n",
" plt.xlabel('Week')\n",
" plt.ylabel('Price')\n",
" plt.legend()\n",
" plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"#take only summer shifted data\n",
"summer_days = np.expand_dims(days[151:243], 1)\n",
"summer_prices_shifted = shifted_prices[151:243]\n",
"\n",
"#apply normalization (pay attention to the fact that we use summer metrics as the model was trained on them!)\n",
"summer_days_norm = (summer_days - summer_days_mean) / summer_days_std\n",
"\n",
"#evaluate MLP on normalized original & shifted data\n",
"print(f\"R-squared value for original prices is: {model.score(summer_days_norm, summer_prices):.02f}.\")\n",
"print(f\"R-squared value for shifted prices is: {model.score(summer_days_norm, summer_prices_shifted):.02f}.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Make sure you execute this cell to observe the plot!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Make sure you execute this cell to observe the plot!\n",
"\n",
"#predict for summer days\n",
"summer_prices_predictions = model.predict(summer_days_norm)\n",
"\n",
"with plt.xkcd():\n",
" plt.plot(summer_days, summer_prices_shifted, label = \"Shifted Data\")\n",
" plt.scatter(summer_days, summer_prices_predictions, label = \"Summer Predictions (Based on Original Model)\", marker='o', color='r', zorder=2)\n",
" plt.xlabel('Week')\n",
" plt.ylabel('Price')\n",
" plt.legend()\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Indeed, the model's predictions are capturing the original phase, not the phase-shifted function. Well, it's somewhat expected: we fixed values of features (in this case, week number), trained at first on one set of output values (prices), and then changed the outputs to measure the model's performance. It's obvious that the model will perform badly. Still, it's important to notice the effect of concept shift and this translation between conceptual effect and its impact on modeling. "
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Exercise 2 Discussion\n",
"\n",
"1. Why do you think the R-squared value is still higher for this particular example of concept shift compared to the covariate shift?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"#to_remove explanation\n",
"\n",
"\"\"\"\n",
"Discussion: Why do you think the R-squared value is still higher for this particular\n",
"example of concept shift compared to the covariate shift?\n",
"\n",
"In this example, concept shift preserves the annual and weekly trends (we can see that\n",
"predictions oscillate the same way as the shifted function). Thus, the R-squared value is relatively high.\n",
"\"\"\";"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_change_in_the_day_of_supply\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"\n",
"## Think!: Examples of concept shift\n",
"\n",
"What other examples of concept shift could you come up with in the context of the \"fruit prices\" task? What variables would these shifts impact?\n",
"\n",
"Take 2 minutes to think, then discuss as a group."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_examples_of_concept_shift\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Summary\n",
"\n",
"*Estimated timing of tutorial: 30 minutes*\n",
"\n",
"Here's what we learned:\n",
"\n",
"1. Covariate and concept shifts are two different types of data distribution shifts.\n",
"2. Distribution shifts negatively impact model performance.\n",
"\n",
"In the next tutorials, we are going to address the question of generalization—what are the techniques and methods to deal with poor generalization performance due to distribution shifts."
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"include_colab_link": true,
"name": "W2D4_Tutorial1",
"provenance": [],
"toc_visible": true
},
"kernel": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.19"
}
},
"nbformat": 4,
"nbformat_minor": 4
}