{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {},
"id": "view-in-github"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"# Tutorial 1: Depth vs width\n",
"\n",
"**Week 2, Day 1: Macrocircuits**\n",
"\n",
"**By Neuromatch Academy**\n",
"\n",
"__Content creators:__ Gabriel Mel de Fontenay\n",
"\n",
"__Content reviewers:__ Surya Ganguli, Xaq Pitkow, Hlib Solodzhuk, Aakash Agrawal, Alish Dipani, Hossein Rezaei, Yousef Ghanbari, Mostafa Abdollahi, Patrick Mineault\n",
"\n",
"__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"___\n",
"\n",
"\n",
"# Tutorial Objectives\n",
"\n",
"*Estimated timing of tutorial: 1 hour*\n",
"\n",
"In this tutorial we will take a closer look at the expressivity of neural networks by observing the following:\n",
"\n",
"- The **universal approximator theorem** guarantees that we can approximate any complex function using a network with a single hidden layer. The catch is that the approximating network might need to be extremely *wide*.\n",
"- We will explore this issue by constructing a complex function and attempting to fit it with shallow networks of varying widths.\n",
"- To create this complex function, we'll build a random deep neural network. This is an example of the **student-teacher setting**, where we attempt to fit a known *teacher* function (the deep network) using a *student* model (the shallow/wide network).\n",
"- We will find that the deep teacher network can be either very easy or very hard to approximate and that the difficulty level is related to a form of **chaos** in the network activities.\n",
"- Each layer of a neural network can effectively expand and fold the input it receives from the previous layer. This repeated expansion and folding grants deep neural networks models high **expressivity** - ie. allows them to implement a large number of different functions.\n",
"\n",
"Let's get started!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @markdown\n",
"from IPython.display import IFrame\n",
"from ipywidgets import widgets\n",
"out = widgets.Output()\n",
"with out:\n",
" print(f\"If you want to download the slides: https://osf.io/download/9n4fj/\")\n",
" display(IFrame(src=f\"https://mfr.ca-1.osf.io/render?url=https://osf.io/9n4fj/?direct%26mode=render%26action=download%26mode=render\", width=730, height=410))\n",
"display(out)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Setup\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install and import feedback gadget\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Install and import feedback gadget\n",
"\n",
"!pip install vibecheck datatops --quiet\n",
"\n",
"from vibecheck import DatatopsContentReviewContainer\n",
"def content_review(notebook_section: str):\n",
" return DatatopsContentReviewContainer(\n",
" \"\", # No text prompt\n",
" notebook_section,\n",
" {\n",
" \"url\": \"https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab\",\n",
" \"name\": \"neuromatch_neuroai\",\n",
" \"user_key\": \"wb2cxze8\",\n",
" },\n",
" ).render()\n",
"\n",
"\n",
"feedback_prefix = \"W2D1_T1\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Imports\n",
"\n",
"#working with data\n",
"import numpy as np\n",
"\n",
"#plotting\n",
"import matplotlib.pyplot as plt\n",
"import logging\n",
"import matplotlib.patheffects as path_effects\n",
"\n",
"#interactive display\n",
"import ipywidgets as widgets\n",
"from tqdm.notebook import tqdm as tqdm\n",
"\n",
"#modeling\n",
"import torch\n",
"import torch.nn as nn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Figure settings\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Figure settings\n",
"\n",
"logging.getLogger('matplotlib.font_manager').disabled = True\n",
"\n",
"%matplotlib inline\n",
"%config InlineBackend.figure_format = 'retina' # perfrom high definition rendering for images and plots\n",
"plt.style.use(\"https://raw.githubusercontent.com/NeuromatchAcademy/course-content/main/nma.mplstyle\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plotting functions\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Plotting functions\n",
"\n",
"def plot_loss(Es):\n",
" \"\"\"\n",
" Plot loss progression over time.\n",
"\n",
" Inputs:\n",
" - Es (np.ndarray): sequence of loss values during training.\n",
" \"\"\"\n",
" with plt.xkcd():\n",
" plt.semilogy(Es_deep)\n",
" plt.xlabel('Epochs')\n",
" plt.ylabel('Error')\n",
" plt.title(\"Loss\")\n",
" plt.show()\n",
"\n",
"def plot_loss_as_function_of_width(Ws_student, Es_test, Es_train):\n",
" \"\"\"\n",
" Plot final loss of training as the function of the width of the network.\n",
" \"\"\"\n",
" with plt.xkcd():\n",
" plt.loglog(Ws_student, Es_test, '.-')\n",
" plt.loglog(Ws_student, Es_train[:,-1], '.-')\n",
" plt.legend(['Test', 'Train'])\n",
" plt.xlabel('Width')\n",
" plt.ylabel('Error')\n",
" plt.title(\"Loss\")\n",
" plt.show()\n",
"\n",
"def plot_students_predictions_vs_teacher_values(Es_train, X_test, y_test):\n",
" \"\"\"\n",
" Plot loss progression over the time and predicted values of student after training versus true ones generated from teacher.\n",
"\n",
" Inputs:\n",
" - Es_train (np.ndarray): loss values.\n",
" - X_test (np.ndarray): test input data.\n",
" - y_test (np.ndarray): test outpu data.\n",
" \"\"\"\n",
" with plt.xkcd():\n",
" fig, axes = plt.subplots(1,2,figsize=(10,5))\n",
" plt.locator_params(nbins=3)\n",
"\n",
" axes[0].semilogy(Es_train/float(y_test.var()))\n",
" axes[0].set_xlabel('Epochs')\n",
" axes[0].set_ylabel('Error')\n",
"\n",
" axes[1].scatter(y_test.detach(),student(X_test).detach())\n",
" axes[1].set_xlabel('Teacher')\n",
" axes[1].set_ylabel('Student')\n",
"\n",
" axes[1].tick_params(axis='y', labelrotation=90)\n",
" axes[1].set_yticks([-0.01,0,0.01])\n",
" axes[1].set_xticks([-0.01,0,0.01])\n",
"\n",
"def expressivity_visualization(layer, projected_traj_1, projected_traj_2, colors):\n",
" \"\"\"\n",
" Plot projected trajectories for points in the given layer for two different networks.\n",
"\n",
" Inputs:\n",
" - layer (int): layer of networks to visualize.\n",
" - projected_traj_1 (np.ndarray): standard network projections.\n",
" - projected_traj_2 (np.ndarray): quasilinear network projections.\n",
" - colors (np.ndarray): colors to use in plotting.\n",
" \"\"\"\n",
"\n",
" with plt.xkcd():\n",
"\n",
" fig = plt.figure()\n",
" fig.suptitle(f'Layer {layer}', fontsize=16)\n",
"\n",
" #standard net\n",
" ax1 = fig.add_subplot(121, projection='3d')\n",
" specific_layer_1 = projected_traj_1[layer]\n",
"\n",
" for i in range(len(specific_layer_1) - 1):\n",
" ax1.plot([specific_layer_1[i, 0], specific_layer_1[i + 1, 0]], [specific_layer_1[i, 1], specific_layer_1[i + 1, 1]], [specific_layer_1[i, 2], specific_layer_1[i + 1, 2]], color=colors[i])\n",
"\n",
" for line in ax1.get_lines():\n",
" line.set_path_effects([path_effects.Normal()])\n",
"\n",
" ax1.set_title('Standard Net')\n",
" ax1.set_xlabel('X')\n",
" ax1.set_ylabel('Y')\n",
" ax1.set_zlabel('Z')\n",
"\n",
" ax2 = fig.add_subplot(122, projection='3d')\n",
" specific_layer_2 = projected_traj_2[layer]\n",
"\n",
" for i in range(len(specific_layer_2) - 1):\n",
" ax2.plot([specific_layer_2[i, 0], specific_layer_2[i + 1, 0]], [specific_layer_2[i, 1], specific_layer_2[i + 1, 1]], [specific_layer_2[i, 2], specific_layer_2[i + 1, 2]], color=colors[i])\n",
"\n",
" for line in ax2.get_lines():\n",
" line.set_path_effects([path_effects.Normal()])\n",
"\n",
" ax2.set_title('Quasi-Linear Net')\n",
" ax2.set_xlabel('X')\n",
" ax2.set_ylabel('Y')\n",
" ax2.set_zlabel('Z')\n",
"\n",
" plt.tight_layout()\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Helper functions\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Helper functions\n",
"\n",
"def generate_trajectories(W, D, P, sigma_1, sigma_2):\n",
" \"\"\"\n",
" Generate trajectories for evenly spaced points from unit circle through networks and project them to 3D space.\n",
"\n",
" Inputs:\n",
" - W (int): width of each layer.\n",
" - D (int): depth of each layer.\n",
" - P (int): number of points from unit circle.\n",
" - sigma_1 (float): standard net standard deviation.\n",
" - sigma_2 (float): quasi-linear net standard deviation.\n",
" \"\"\"\n",
" #initialize nets\n",
" standard_net = make_MLP(2, W, D)\n",
" initialize_layers(standard_net, sigma_1)\n",
"\n",
" quasilinear_net = make_MLP(2, W, D)\n",
" initialize_layers(quasilinear_net, sigma_2)\n",
"\n",
" #sample points from unit circle\n",
" theta = np.linspace(0, 2 * np.pi, P)\n",
" points = np.array([np.cos(theta), np.sin(theta)]).T\n",
"\n",
" #generate trajectories for first net\n",
" parameters = [param for param in standard_net.parameters()]\n",
"\n",
" for index in range(len(parameters)):\n",
" if not index:\n",
" traj_1 = [np.tanh(points @ parameters[index].detach().numpy().T)]\n",
" else:\n",
" traj_1.append(np.tanh(traj_1[-1] @ parameters[index].detach().numpy().T))\n",
"\n",
" #generate trajectories for second net\n",
" parameters = [param for param in quasilinear_net.parameters()]\n",
"\n",
" for index in range(0, len(parameters)):\n",
" if not index:\n",
" traj_2 = [np.tanh(points @ parameters[index].detach().numpy().T)]\n",
" else:\n",
" traj_2.append(np.tanh(traj_2[-1] @ parameters[index].detach().numpy().T))\n",
"\n",
" return np.array(traj_1[:-1]), np.array(traj_2[:-1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set random seed\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Set random seed\n",
"\n",
"import random\n",
"import numpy as np\n",
"\n",
"def set_seed(seed=None, seed_torch=True):\n",
" if seed is None:\n",
" seed = np.random.choice(2 ** 32)\n",
" random.seed(seed)\n",
" np.random.seed(seed)\n",
" if seed_torch:\n",
" torch.manual_seed(seed)\n",
" torch.cuda.manual_seed_all(seed)\n",
" torch.cuda.manual_seed(seed)\n",
" torch.backends.cudnn.benchmark = False\n",
" torch.backends.cudnn.deterministic = True\n",
"\n",
"set_seed(seed = 42)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"\n",
"# Section 1: Introduction\n",
"\n",
"In this section we will create functions to capture the snippets of code that we will use repeatedly in what follows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Video 1: Introduction\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 1: Introduction\n",
"\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"video_ids = [('Youtube', 'KgsFMiF1Uh0'), ('Bilibili', 'BV1YD421M78R')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"\n",
"The [universal approximator theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem) (UAT) guarantees that we can approximate any function arbitrarily well using a shallow network - ie. a network with a single hidden layer (figure below, left). So why do we need depth? The \"catch\" in the UAT is that approximating a complex function with a shallow network can require a very large number of hidden units - ie. the network must be very wide. The inability of shallow networks to efficiently implement certain functions suggests that network depth may be one of the brain's computational \"secret sauces\".\n",
"\n",
"
\n",
"\n",
"To illustrate this fact, we'll create a complex function and then attempt to fit it with single-hidden-layer neural networks of different widths. What we'll find is that although the UAT guarantees that sufficiently wide networks can approximate our function, the performance will actually not be very good for our shallow nets of modest width.\n",
"\n",
"One easy way to create a complex function is to build a random deep neural network (figure above, right). We then have a teacher network which generates the ground truth outputs, and a student network whose goal is to learn the mapping implemented by the teacher. This approach - known as the **student-teacher setting** - is useful for both computational and mathematical study of neural networks since it gives us complete control of the data generation process. Unlike with real-world data, we know the exact distribution of inputs and correct outputs.\n",
"\n",
"Finally, we will show that depending on the distribution of the weights, a random deep neural network can be either very difficult or very easy to approximate with a shallow network. The \"complexity\" of the function computed by a random deep network thus depends crucially on the weight distribution. One can actually understand the boundary between hard and easy cases as a kind of boundary between chaos and non-chaos in a certain dynamical system. We will confirm that on the non-chaotic side, a random deep neural network can be effectively approximated by a shallow net. This demonstration will be based on ideas from the paper:\n",
"\n",
"[*Exponential expressivity in deep neural networks through transient chaos*](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html) Poole et al. Neurips (2016)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_introduction\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Video 2: Setup\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 2: Setup\n",
"\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"video_ids = [('Youtube', 'skw2TLi9oa8'), ('Bilibili', 'BV1yi421v72o')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_setup\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 1: Create an MLP\n",
"\n",
"The code below implements a function that takes in an input dimension, a layer width, and a number of layers and creates a simple MLP in pytorch. In between each layer, we insert a hyperbolic tangent nonlinearity layer (`nn.Tanh()`).\n",
"\n",
"Convention: Because we will count the input as a layer, a depth of 2 will mean a network with just one hidden layer, followed by the output neuron. A depth of 3 will mean 2 hidden layers, and so on."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Network Implementation\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Network Implementation\n",
"\n",
"def make_MLP(n_in, W, D, nonlin = 'tanh'):\n",
" \"\"\"\n",
" Create `nn.Sequnetial()` fully-connected model in pytorch with the given parameters.\n",
"\n",
" Inputs:\n",
" - n_in (int): input dimension.\n",
" - W (int): width of the network.\n",
" - D (int): depth if the network.\n",
" - nonlin (str, default = \"tanh\"): activation function to use.\n",
"\n",
" Outputs:\n",
" - net (nn.Sequential): network.\n",
" \"\"\"\n",
"\n",
" #activation function\n",
" if nonlin == 'tanh':\n",
" nonlin = nn.Tanh()\n",
" elif nonlin == 'relu':\n",
" nonlin == nn.ReLU()\n",
" else:\n",
" assert(False)\n",
"\n",
" # Assemble D-1 hidden layers and one output layer\n",
"\n",
" #input layer\n",
" layers = [nn.Linear(n_in, W, bias = False), nonlin]\n",
" for i in range(D - 2):\n",
" #linear layer\n",
" layers.append(nn.Linear(W, W, bias = False))\n",
" #activation function\n",
" layers.append(nonlin)\n",
" #output layer\n",
" layers.append(nn.Linear(W, 1, bias = False))\n",
"\n",
" return nn.Sequential(*layers)\n",
"\n",
"net = make_MLP(n_in = 10, W = 3, D = 2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Now, we implement an auxiliary function which calculates the number of parameters in the MLP. "
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"def get_num_params(n_in,W,D):\n",
" \"\"\"\n",
" Simple function to compute number of learned parameters in an MLP with given dimensions.\n",
"\n",
" Inputs:\n",
" - n_in (int): input dimension.\n",
" - W (int): width of the network.\n",
" - D (int): depth if the network.\n",
"\n",
" Outputs:\n",
" - num_params (int): number of parameters in the network.\n",
" \"\"\"\n",
" ###################################################################\n",
" ## Fill out the following then remove\n",
" raise NotImplementedError(\"Student exercise: complete function which calculates the number of parameters in the defined architecture of MLP.\")\n",
" ###################################################################\n",
"\n",
" input_params = ... * ...\n",
" hidden_layers_params = (...) * ...**2\n",
" output_params = ...\n",
" return input_params + hidden_layers_params + output_params\n",
"\n",
"np.testing.assert_allclose(get_num_params(10, 3, 2), 33, err_msg = \"Expected value of parameters number is different!\")\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"#to_remove solution\n",
"\n",
"def get_num_params(n_in,W,D):\n",
" \"\"\"\n",
" Simple function to compute number of learned parameters in an MLP with given dimensions.\n",
"\n",
" Inputs:\n",
" - n_in (int): input dimension.\n",
" - W (int): width of the network.\n",
" - D (int): depth if the network.\n",
"\n",
" Outputs:\n",
" - num_params (int): number of parameters in the network.\n",
" \"\"\"\n",
" input_params = n_in * W\n",
" hidden_layers_params = (D-2) * W**2\n",
" output_params = W\n",
" return input_params + hidden_layers_params + output_params\n",
"\n",
"np.testing.assert_allclose(get_num_params(10, 3, 2), 33, err_msg = \"Expected value of parameters number is different!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_create_mlp\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 2: Initialize model weights\n",
"\n",
"Write a function that, given a model and a $\\sigma$, initializes all weights in the model according to a normal distribution with mean $0$ and standard deviation\n",
" \n",
" $$\\frac{\\sigma}{\\sqrt{n_{in}}},$$\n",
" \n",
" where $n_{in}$ is the number of inputs to the layer."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"set_seed(42)\n",
"\n",
"def initialize_layers(net,sigma):\n",
" \"\"\"\n",
" Set weight to each of the parameters in the model of value sigma/sqrt(n_in), where n_in is the number of inputs to the layer.\n",
"\n",
" Inputs:\n",
" - net (nn.Sequential): network.\n",
" - sigma (float): standard deviation.\n",
" \"\"\"\n",
" ###################################################################\n",
" ## Fill out the following then remove\n",
" raise NotImplementedError(\"Student exercise: set initial values to the weights of MLP.\")\n",
" ###################################################################\n",
" for param in ...:\n",
" n_in = param.shape[1]\n",
" nn.init.normal_(param, std = ...)\n",
"\n",
"initialize_layers(net, 1)\n",
"np.testing.assert_allclose(next(net.parameters())[0][0].item(), 0.609, err_msg = \"Expected value of parameter is different!\", atol = 1e-3)\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# to_remove solution\n",
"set_seed(42)\n",
"\n",
"def initialize_layers(net,sigma):\n",
" \"\"\"\n",
" Set weight to each of the parameters in the model of value sigma/sqrt(n_in), where n_in is the number of inputs to the layer.\n",
"\n",
" Inputs:\n",
" - net (nn.Sequential): network.\n",
" - sigma (float): standard deviation.\n",
" \"\"\"\n",
" for param in net.parameters():\n",
" n_in = param.shape[1]\n",
" nn.init.normal_(param, std = sigma/np.sqrt(n_in))\n",
"\n",
"initialize_layers(net, 1)\n",
"np.testing.assert_allclose(next(net.parameters())[0][0].item(), 0.609, err_msg = \"Expected value of parameter is different!\", atol = 1e-3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_initialize_model_weights\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 3: Generate a dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Given a network, generate the input data by sampling from a multivariate Gaussian distribution and output data by passing the inputs through the network. Don't forget to `.detach()` the outputs - otherwise, gradients will be computed for these (with respect to the teacher weights, which we don't want)."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"set_seed(42)\n",
"\n",
"def make_data(net, n_in, n_examples):\n",
" \"\"\"\n",
" Generate data by sampling from a multivariate gaussian distribution, and output data by passing the inputs through the network.\n",
"\n",
" Inputs:\n",
" - net (nn.Sequential): network.\n",
" - n_in (int): input dimension.\n",
" - n_examples (int): number of data examples to generate.\n",
"\n",
" Outputs:\n",
" - X (torch.tensor): input data.\n",
" - y (torch.tensor): output data.\n",
" \"\"\"\n",
" ###################################################################\n",
" ## Fill out the following then remove\n",
" raise NotImplementedError(\"Student exercise: complete data generation.\")\n",
" ###################################################################\n",
" X = torch.randn(..., ...)\n",
" y = net(...).detach()\n",
" return X, ...\n",
"\n",
"X, y = make_data(net, 10, 10000000)\n",
"np.testing.assert_allclose(X[0][0].item(), 1.927, err_msg = \"Expected value of data is different!\", atol = 1e-3)\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# to_remove solution\n",
"set_seed(42)\n",
"\n",
"def make_data(net, n_in, n_examples):\n",
" \"\"\"\n",
" Generate data by sampling from a multivariate gaussian distribution, and output data by passing the inputs through the network.\n",
"\n",
" Inputs:\n",
" - net (nn.Sequential): network.\n",
" - n_in (int): input dimension.\n",
" - n_examples (int): number of data examples to generate.\n",
"\n",
" Outputs:\n",
" - X (torch.tensor): input data.\n",
" - y (torch.tensor): output data.\n",
" \"\"\"\n",
" X = torch.randn(n_examples, n_in)\n",
" y = net(X).detach()\n",
" return X, y\n",
"\n",
"X, y = make_data(net, 10, 10000000)\n",
"np.testing.assert_allclose(X[0][0].item(), 1.927, err_msg = \"Expected value of data is different!\", atol = 1e-3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_generate_dataset\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 4: Train model and compute loss"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"In this coding exercise, write a function that will train a given net on a given dataset. Function parameters include the network, the training inputs and outputs, the number of steps, and the learning rate. Set up loss function as MSE."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"set_seed(42)\n",
"\n",
"def train_model(net, X, y, n_epochs, lr, progressbar=True):\n",
" \"\"\"\n",
" Perform training of the network.\n",
"\n",
" Inputs:\n",
" - net (nn.Sequential): network.\n",
" - X (torch.tensor): input data.\n",
" - y (torch.tensor): output data.\n",
" - n_epochs (int): number of epochs to train the model for.\n",
" - lr (float): learning rate for optimizer (we will use `Adam` by default).\n",
" - progressbar (bool, default = True): whether to use additional bar for displaying training progress.\n",
"\n",
" Outputs:\n",
" - Es (np.ndarray): array which contains loss for each epoch.\n",
" \"\"\"\n",
" ###################################################################\n",
" ## Fill out the following then remove\n",
" raise NotImplementedError(\"Student exercise: complete training of the network.\")\n",
" ###################################################################\n",
"\n",
" # Set up optimizer\n",
" loss_fn = ...\n",
" optimizer = torch.optim.Adam(..., lr = ...)\n",
"\n",
" # Run training loop\n",
" Es = np.zeros(...)\n",
" for n in (tqdm(range(n_epochs)) if progressbar else range(n_epochs)):\n",
" y_pred = net(...)\n",
" loss = loss_fn(..., y)\n",
" optimizer.zero_grad()\n",
" loss.backward()\n",
" optimizer.step()\n",
" Es[n] = float(...)\n",
"\n",
" return Es\n",
"\n",
"Es = train_model(net, X, y, 10, 1e-3)\n",
"np.testing.assert_allclose(Es[0], 0.0, err_msg = \"Expected value of loss is different!\", atol = 1e-3)\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"#to_remove solution\n",
"\n",
"set_seed(42)\n",
"\n",
"def train_model(net, X, y, n_epochs, lr, progressbar=True):\n",
" \"\"\"\n",
" Perform training of the network.\n",
"\n",
" Inputs:\n",
" - net (nn.Sequential): network.\n",
" - X (torch.tensor): input data.\n",
" - y (torch.tensor): output data.\n",
" - n_epochs (int): number of epochs to train the model for.\n",
" - lr (float): learning rate for optimizer (we will use `Adam` by default).\n",
" - progressbar (bool, default = True): whether to use additional bar for displaying training progress.\n",
"\n",
" Outputs:\n",
" - Es (np.ndarray): array which contains loss for each epoch.\n",
" \"\"\"\n",
"\n",
" # Set up optimizer\n",
" loss_fn = nn.MSELoss()\n",
" optimizer = torch.optim.Adam(net.parameters(), lr = lr)\n",
"\n",
" # Run training loop\n",
" Es = np.zeros(n_epochs)\n",
" for n in (tqdm(range(n_epochs)) if progressbar else range(n_epochs)):\n",
" y_pred = net(X)\n",
" loss = loss_fn(y_pred, y)\n",
" optimizer.zero_grad()\n",
" loss.backward()\n",
" optimizer.step()\n",
" Es[n] = float(loss.detach())\n",
"\n",
" return Es\n",
"\n",
"Es = train_model(net, X, y, 10, 1e-3)\n",
"np.testing.assert_allclose(Es[0], 0.0, err_msg = \"Expected value of loss is different!\", atol = 1e-3)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Coding Exercise 4 Discussion\n",
"\n",
"Why do you think we obtain zero error right away (on the first epoch)?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"#to_remove explanation\n",
"\n",
"\"\"\"\n",
"Discussion: Why do you think we obtain zero error right away (on the first epoch)?\n",
"\n",
"The network we are training also generates the data. Thus, there is no\n",
"need to change weights at all, the gradient is zero.\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Now, write a helper function that computes the loss of a net on a dataset. It takes the following parameters: the network and the dataset inputs and outputs."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"def compute_loss(net, X, y):\n",
" \"\"\"\n",
" Calculate loss on given network and data.\n",
"\n",
" Inputs:\n",
" - net (nn.Sequential): network.\n",
" - X (torch.tensor): input data.\n",
" - y (torch.tensor): output data.\n",
"\n",
" Outputs:\n",
" - loss (float): computed loss.\n",
" \"\"\"\n",
" ###################################################################\n",
" ## Fill out the following then remove\n",
" raise NotImplementedError(\"Student exercise: complete loss calculation.\")\n",
" ###################################################################\n",
" loss_fn = ...\n",
"\n",
" y_pred = ...\n",
" loss = loss_fn(..., ...)\n",
" loss = float(...)\n",
" return loss\n",
"\n",
"loss = compute_loss(net, X, y)\n",
"np.testing.assert_allclose(loss, 0.0, err_msg = \"Expected value of loss is different!\", atol = 1e-3)\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"#to_remove solution\n",
"\n",
"def compute_loss(net, X, y):\n",
" \"\"\"\n",
" Calculate loss on given network and data.\n",
"\n",
" Inputs:\n",
" - net (nn.Sequential): network.\n",
" - X (torch.tensor): input data.\n",
" - y (torch.tensor): output data.\n",
"\n",
" Outputs:\n",
" - loss (float): computed loss.\n",
" \"\"\"\n",
" loss_fn = nn.MSELoss()\n",
"\n",
" y_pred = net(X)\n",
" loss = loss_fn(y_pred, y)\n",
" loss = float(loss.detach())\n",
" return loss\n",
"\n",
"loss = compute_loss(net, X, y)\n",
"np.testing.assert_allclose(loss, 0.0, err_msg = \"Expected value of loss is different!\", atol = 1e-3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_train_model_and_compute_loss\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"\n",
"# Section 2: Fitting a deep network with a shallow network\n",
"\n",
"Estimated timing to here from start of tutorial: 20 minutes\n",
"\n",
"We will now use the functions we've created to experiment with deep network fitting. In particular, we will see to what extent it is possible to fit a deep net using a shallow net. Specifically, we will fix a deep teacher and then fit it with a single-hidden-layer net with varying width value. In principle, if the number of hidden units is large enough, the error should be low. Let's see!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Video 3: Deep network fit with a shallow network\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 3: Deep network fit with a shallow network\n",
"\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"video_ids = [('Youtube', 'blzAxXqh1EU'), ('Bilibili', 'BV13i421e73V')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_deep_network_fit_with_a_shallow_network\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 5: Create learning problem\n",
"\n",
"Create a \"deep\" teacher network that accepts inputs of size 5. Give the network a width of 5 and a depth of 5. Use this to generate both a training and test set with 4000 examples for training and 1000 for testing. Initialize weights with a standard deviation of 2.0."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"###################################################################\n",
"## Fill out the following then remove\n",
"raise NotImplementedError(\"Student exercise: complete set up.\")\n",
"###################################################################\n",
"torch.manual_seed(-1)\n",
"\n",
"# Create teacher\n",
"n_in = ... # input dimension\n",
"W_teacher, D_teacher = ..., ... # teacher width, depth\n",
"sigma_teacher = ... # teacher weight variance\n",
"teacher = make_MLP(..., ..., ...)\n",
"initialize_layers(..., ...)\n",
"\n",
"# generate train and test set\n",
"N_train, N_test = ..., ...\n",
"X_train, y_train = make_data(..., ..., ...)\n",
"X_test, y_test = make_data(..., ..., ...)\n",
"\n",
"np.testing.assert_allclose(X_test[0][0].item(), 0.19076240062713623, err_msg = \"Expected value of data is different!\")\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"#to_remove solution\n",
"torch.manual_seed(-1)\n",
"\n",
"# Create teacher\n",
"n_in = 5 # input dimension\n",
"W_teacher, D_teacher = 5, 5 # teacher width, depth\n",
"sigma_teacher = 2 # teacher weight variance\n",
"teacher = make_MLP(n_in, W_teacher, D_teacher)\n",
"initialize_layers(teacher, sigma_teacher)\n",
"\n",
"# generate train and test set\n",
"N_train, N_test = 4000, 1000\n",
"X_train, y_train = make_data(teacher, n_in, N_train)\n",
"X_test, y_test = make_data(teacher, n_in, N_test)\n",
"\n",
"np.testing.assert_allclose(X_test[0][0].item(), 0.19076240062713623, err_msg = \"Expected value of data is different!\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Coding Exercise 5 Discussion\n",
"\n",
"1. What is the minimum error achievable by an MLP on the generated problem?\n",
"2. What is the minimum error achievable by a 1-hidden-layer MLP?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"#to_remove explanation\n",
"\n",
"\"\"\"\n",
"Discussion: 1. What is the minimum error achievable by an MLP on the generated problem?\n",
"2. What is the minimum error achievable by a 1-hidden-layer MLP?\n",
"\n",
"1. This is a trick question! We generated the data ourselves; the teacher network is an MLP. In principle, a student network with the same architecture could learn the exact weights of the teacher and achieve exactly 0 error.\n",
"2. By the universal approximator theorem, we can approximate the teacher network arbitrarily well with a 1-hidden-layer MLP, as long as there is not limit on the number of hidden units. So the answer is technically 0. In practice, however, when fitting a complex function, for example a deep teacher network, the number of hidden units required for low error can be impractical.\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_create_learning_problem\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 6: Train net with the same architecture\n",
"\n",
"Create a student network with the same architecture as the teacher network - that is, the same width and depth. Train it and confirm that a network with the same architecture can indeed achieve low test error. You may need to train for a large number of iterations, and you may need to adjust the learning rate as learning proceeds.\n",
"\n",
"First, let's confirm that the number of training examples is greater than 3 times the number of parameters, so we have enough data to train the network."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"n_in = 5\n",
"W_student, D_student = 5, 5\n",
"student = make_MLP(n_in, W_student, D_student)\n",
"\n",
"# make sure we have enough data\n",
"P = get_num_params(n_in, W_student, D_student)\n",
"assert(N_train > 3*P)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Now, let's train the student and observe the loss on a semi-log plot (the y-axis is logarithmic)! Your task is to complete the missing parts of the code. While the model is training training, you can go to the next coding exercise and return back to observe the results (it will take approximately 5 minutes)."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"###################################################################\n",
"## Fill out the following then remove\n",
"raise NotImplementedError(\"Student exercise: train student on the generated data from teacher.\")\n",
"###################################################################\n",
"lr = 0.003\n",
"Es_deep = []\n",
"for i in range(4):\n",
" Es_deep.append(train_model(..., ..., ..., 50000, ...))\n",
" #observe we reduce learning rate\n",
" lr /= 3\n",
"Es_deep = np.array(Es_deep)\n",
"Es_deep = Es_deep.ravel()\n",
"\n",
"# evaluate test error\n",
"loss_deep = compute_loss(..., ..., ...) / float(y_test.var())\n",
"print(\"Loss of deep student: \",loss_deep)\n",
"plot_loss(Es_deep)\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"#to_remove solution\n",
"lr = 0.003\n",
"Es_deep = []\n",
"for i in range(4):\n",
" Es_deep.append(train_model(student, X_train, y_train, 50000, lr))\n",
" #observe we reduce learning rate\n",
" lr /= 3\n",
"Es_deep = np.array(Es_deep)\n",
"Es_deep = Es_deep.ravel()\n",
"\n",
"# evaluate test error\n",
"loss_deep = compute_loss(student, X_test, y_test) / float(y_test.var())\n",
"print(\"Loss of deep student: \",loss_deep)\n",
"plot_loss(Es_deep)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_train_net_with_the_same_architecture\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 7: Train a 2 layer neural net with varying width"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Let us now try to fit the deep teacher network with a shallow student network. Let's give the student a single hidden layer, and let's study the error as a function of the student width $W_s$. For a range of widths between, say, 5 and 200, create a student network, train it on the training set, and compute its test error. The training time will take approximately 2 minutes.\n",
"\n",
"Then, plot the training and testing errors as a function of width on a log-log plot. How does the error of the shallow network compare to that of the deep network? "
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"D_student = 2 # student depth\n",
"Ws_student = np.array([5, 15, 45, 135]) # widths\n",
"\n",
"lr = 1e-3\n",
"n_epochs = 20000\n",
"Es_shallow_train = np.zeros((len(Ws_student), n_epochs))\n",
"Es_shallow_test = np.zeros(len(Ws_student))\n",
"\n",
"###################################################################\n",
"## Fill out the following then remove\n",
"raise NotImplementedError(\"Student exercise: train different students on the already generated data from teacher.\")\n",
"###################################################################\n",
"\n",
"for index, W_student in enumerate(tqdm(Ws_student)):\n",
"\n",
" student = make_MLP(..., ..., ...)\n",
"\n",
" # make sure we have enough data\n",
" P = get_num_params(n_in, W_student, D_student)\n",
" assert(N_train > 3*P)\n",
"\n",
" # train\n",
" Es_shallow_train[index] = train_model(..., ..., ..., ..., lr, progressbar=False)\n",
" Es_shallow_train[index] /= y_test.var()\n",
"\n",
" # evaluate test error\n",
" loss = compute_loss(..., ..., ...)/y_test.var()\n",
" Es_shallow_test[index] = ...\n",
"\n",
"plot_loss_as_function_of_width(Ws_student, Es_shallow_test, Es_shallow_train)\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"#to_remove solution\n",
"D_student = 2 # student depth\n",
"Ws_student = np.array([5, 15, 45, 135]) # widths\n",
"\n",
"lr = 1e-3\n",
"n_epochs = 20000\n",
"Es_shallow_train = np.zeros((len(Ws_student), n_epochs))\n",
"Es_shallow_test = np.zeros(len(Ws_student))\n",
"\n",
"\n",
"for index, W_student in enumerate(tqdm(Ws_student)):\n",
"\n",
" student = make_MLP(n_in, W_student, D_student)\n",
"\n",
" # make sure we have enough data\n",
" P = get_num_params(n_in, W_student, D_student)\n",
" assert(N_train > 3*P)\n",
"\n",
" # train\n",
" Es_shallow_train[index] = train_model(student, X_train, y_train, n_epochs, lr, progressbar=False)\n",
" Es_shallow_train[index] /= y_test.var()\n",
"\n",
" # evaluate test error\n",
" loss = compute_loss(student, X_test, y_test)/y_test.var()\n",
" Es_shallow_test[index] = loss\n",
"\n",
"plot_loss_as_function_of_width(Ws_student, Es_shallow_test, Es_shallow_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_train_two_layer_net_with_varying_width\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 8: Network size prediction"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Let's suppose that the test error will continue to improve with increasing width according to the same trend in the previous plot - which is probably too optimistic but will let us do some back-of-the-envelope calculations. Specifically, let us assume there is a linear relationship\n",
"\n",
"$$ \\log E=m \\log W+b$$\n",
"between the log of the width and the log of the error. Fit this linear model from our experiment and use it to predict the number of hidden units needed to achieve a relative error of, say, $10^{-6}$."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"error_target = 1e-6\n",
"\n",
"###################################################################\n",
"## Fill out the following then remove\n",
"raise NotImplementedError(\"Student exercise: fit linear model and predict the number of hidden units.\")\n",
"###################################################################\n",
"\n",
"m,b = np.polyfit(np.log(...), np.log(...), 1)\n",
"print('Predicted width: ', np.exp((np.log(...) - ...) / ...))\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# to_remove solution\n",
"error_target = 1e-6\n",
"\n",
"m,b = np.polyfit(np.log(Ws_student), np.log(Es_shallow_test), 1)\n",
"print('Predicted width: ', np.exp((np.log(error_target) - b) / m))"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Based on this, do you think that a reasonably sized shallow network could learn this task with low error? "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_network_size_prediction\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"\n",
"# Section 3: Deep networks in the quasilinear regime\n",
"\n",
"Estimated timing to here from start of tutorial: 45 minutes\n",
"\n",
"We've just shown that certain deep networks are difficult to fit. In this section, we will discuss a regime in which a shallow network is able to approximate a deep teacher relatively well."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Video 4: Deep networks in the quasilinear regime\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 4: Deep networks in the quasilinear regime\n",
"\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"video_ids = [('Youtube', 'XuAcOiqJuDs'), ('Bilibili', 'BV1CT421e7Q7')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_deep_networks_in_the_quasilinear_regime\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"One of the reasons that shallow nets cannot fit deep nets, in general, is that random deep nets, in certain regimes, behave like chaotic systems: each layer can be thought of as a single step of a dynamical system, and the number of layers plays the role of the number of time steps. A deep network, therefore, effectively subjects its input to long-time chaotic dynamics, which are, almost by definition, very difficult to predict accurately. In particular, *shallow* nets simply cannot capture the complex mapping implemented by deeper networks without resorting to an astronomical number of hidden units. Another way to interpret this behavior is that the many layers of a deep network repeatedly stretch and fold their inputs, allowing the network to implement a large number of complex functions - an idea known as **expressivity** ([Poole et al. 2016](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html)).\n",
"\n",
"However, in other regimes, for example, when the weights of the teacher network are small, the dynamics implemented by the teacher network are no longer chaotic. In fact, for small enough weights, they are nearly linear. In this regime, we'd expect a shallow network to be able to approximate a deep teacher relatively well.\n",
"\n",
"For more on these ideas, see the paper\n",
"\n",
"[*Exponential expressivity in deep neural networks through transient chaos*](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html) Poole et al. Neurips (2016).\n",
"\n",
"To test this idea, we'll repeat the exercise above, this time initializing the teacher weights with a small $\\sigma$, say, $0.4$, so that the teacher network is quasi-linear."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 9: Create dataset & Train a student network\n",
"\n",
"Create training and test sets. Initialize the teacher network with $\\sigma_{t} = 0.4$."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"###################################################################\n",
"## Fill out the following then remove\n",
"raise NotImplementedError(\"Student exercise: complete set up.\")\n",
"###################################################################\n",
"torch.manual_seed(-1)\n",
"\n",
"# Create teacher\n",
"n_in = 5 # input dimension\n",
"W_teacher, D_teacher = 5, 5 # teacher width, depth\n",
"sigma_teacher = ... # teacher weight variance\n",
"teacher = make_MLP(..., ..., ...)\n",
"initialize_layers(..., ...)\n",
"\n",
"# generate train and test set\n",
"N_train, N_test = 4000, 1000\n",
"X_train, y_train = make_data(..., ..., ...)\n",
"X_test, y_test = make_data(..., ..., ...)\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"#to_remove solution\n",
"torch.manual_seed(-1)\n",
"\n",
"# Create teacher\n",
"n_in = 5 # input dimension\n",
"W_teacher, D_teacher = 5, 5 # teacher width, depth\n",
"sigma_teacher = 0.4 # teacher weight variance\n",
"teacher = make_MLP(n_in, W_teacher, D_teacher)\n",
"initialize_layers(teacher, sigma_teacher)\n",
"\n",
"# generate train and test set\n",
"N_train, N_test = 4000, 1000\n",
"X_train, y_train = make_data(teacher, n_in, N_train)\n",
"X_test, y_test = make_data(teacher, n_in, N_test)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Give the student network a single hidden layer with $10$ units. Train it for a similar amount of time as before. Determine the relative MSE."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"###################################################################\n",
"## Fill out the following then remove\n",
"raise NotImplementedError(\"Student exercise: train student on the generated data from special teacher.\")\n",
"###################################################################\n",
"\n",
"W_student, D_student = ..., ... # student width, depth\n",
"\n",
"lr = 1e-3\n",
"n_epochs = 20000\n",
"Es_shallow_train = np.zeros((len(Ws_student),n_epochs))\n",
"Es_shallow_test = np.zeros(len(Ws_student))\n",
"\n",
"student = make_MLP(..., ..., ...)\n",
"initialize_layers(student, sigma_teacher)\n",
"\n",
"# make sure we have enough data\n",
"P = get_num_params(n_in, W_student, D_student)\n",
"assert(N_train > 3*P)\n",
"\n",
"# train\n",
"Es_shallow_train = train_model(..., ..., ..., n_epochs, lr, progressbar=True)\n",
"\n",
"# # evaluate test error\n",
"Es_shallow_test = compute_loss(..., ..., ...)/float(y_test.var())\n",
"print('Shallow student loss: ',Es_shallow_test)\n",
"plot_students_predictions_vs_teacher_values(Es_shallow_train, X_test, y_test)\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"#to_remove solution\n",
"W_student, D_student = 10, 2 # student width, depth\n",
"\n",
"lr = 1e-3\n",
"n_epochs = 20000\n",
"Es_shallow_train = np.zeros((len(Ws_student),n_epochs))\n",
"Es_shallow_test = np.zeros(len(Ws_student))\n",
"\n",
"student = make_MLP(n_in, W_student, D_student)\n",
"initialize_layers(student, sigma_teacher)\n",
"\n",
"# make sure we have enough data\n",
"P = get_num_params(n_in, W_student, D_student)\n",
"assert(N_train > 3*P)\n",
"\n",
"# train\n",
"Es_shallow_train = train_model(student, X_train, y_train, n_epochs, lr, progressbar=True)\n",
"\n",
"# # evaluate test error\n",
"Es_shallow_test = compute_loss(student, X_test, y_test)/float(y_test.var())\n",
"print('Shallow student loss: ',Es_shallow_test)\n",
"plot_students_predictions_vs_teacher_values(Es_shallow_train, X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_create_dataset_train_student_network\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Video 5: Conclusion & Interactive Demo\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 5: Conclusion & Interactive Demo\n",
"\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"video_ids = [('Youtube', 'LLLXnOqUeoM'), ('Bilibili', 'BV1Zx4y1t7iN')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_conclusion_interactive_demo\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Interactive Demo 1: Deep networks expressivity\n",
"\n",
"In this demo, we invite you to explore the expressivity of two distinct deep networks already introduced earlier: one with $\\sigma = 2$ and another (quasi-linear) with $\\sigma = 0.4$. \n",
"\n",
"We initialize two deep networks with $D=20$ layers with $W = 100$ hidden units each but different variances in their random parameters. Then, 400 input data points are generated on a unit circle. We will examine how these points are propagated through the networks.\n",
"\n",
"To visualize each layer's activity, we randomly project it into 3 dimensions. The slider below controls which layer you are seeing. On the left, you'll see how a standard network processes its inputs, and on the right, how a quasi-linear network does so. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Execute the cell to observe interactive widget\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @markdown Execute the cell to observe interactive widget\n",
"\n",
"set_seed(42)\n",
"\n",
"W = 100 #width\n",
"D = 20 #depth\n",
"P = 400 #number of points\n",
"sigma_1 = 2 #standard net\n",
"sigma_2 = 0.4 #quasi-linear net\n",
"\n",
"colors = plt.cm.hsv(np.linspace(0, 1, P)) #color\n",
"random_projection = np.random.normal(size = (W, 3)) #random projection\n",
"\n",
"traj_1, traj_2 = generate_trajectories(W, D, P, sigma_1, sigma_2)\n",
"\n",
"#project trajectories from 100-D to 3-D\n",
"projected_traj_1 = traj_1 @ random_projection\n",
"projected_traj_2 = traj_2 @ random_projection\n",
"\n",
"@widgets.interact\n",
"def expressivity_interactive_visualization(layer = widgets.IntSlider(description=\"Layer\", min=0, max=18, step=1, value=0)):\n",
" expressivity_visualization(layer, projected_traj_1, projected_traj_2, colors)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Interactive Demo 1 Discussion\n",
"\n",
"1. What is the qualitative difference between trajectories propagation through these networks? Does it fit what we have seen earlier with wide student approximation?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"#to_remove explanation\n",
"\n",
"\"\"\"\n",
"Discussion: What is the qualitative difference between trajectories propagation through these networks? Does it fit what we have seen earlier with wide student approximation?\n",
"\n",
"Indeed, a standard network (with sigma = 2) is much more expressive; it folds the space here and there, creating vivid and tangled representations with each additional layer, whereas the quasi-linear network preserves the original structure.\n",
"It is in line with the experiments on wide student approximation as shallow and wide networks cannot express the tangled representation which a standard net creates.\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_deep_network_expressivity\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Summary\n",
"\n",
"*Estimated timing of tutorial: 1 hour*"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"In this tutorial:\n",
"- We discussed the **universal approximator theorem**, which guarantees that we can approximate any complex function using a network with a single hidden layer.\n",
"- To test this idea, we built a deep *teacher* network and attempted to fit it with a shallow *student* network.\n",
"- We found that achieving good performance requires a very wide network - i.e., a very large number of hidden units.\n",
"- We found that if the teacher network is initialized with very small weights, the fitting becomes very easy.\n",
"- We discussed how the fitting difficulty is related to whether the teacher is initialized in the **chaotic** regime.\n",
"- Chaotic behavior is related to network **expressivity**, the network's ability to implement a large number of complex functions."
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"include_colab_link": true,
"name": "W2D1_Tutorial1",
"provenance": [],
"toc_visible": true
},
"kernel": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.19"
}
},
"nbformat": 4,
"nbformat_minor": 4
}