{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "191ddb10-b898-4fd0-9b91-0c46ca631d70",
   "metadata": {
    "execution": {}
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/neuromatch/NeuroAI_Course/blob/main/tutorials/W1D3_ComparingArtificialAndBiologicalNetworks/student/W1D3_Tutorial4.ipynb\" target=\"_blank\"><img alt=\"Open In Colab\" src=\"https://colab.research.google.com/assets/colab-badge.svg\"/></a>   <a href=\"https://kaggle.com/kernels/welcome?src=https://raw.githubusercontent.com/neuromatch/NeuroAI_Course/main/tutorials/W1D3_ComparingArtificialAndBiologicalNetworks/student/W1D3_Tutorial4.ipynb\" target=\"_blank\"><img alt=\"Open in Kaggle\" src=\"https://kaggle.com/static/images/open-in-kaggle.svg\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5cb666b5-7c0a-4486-a687-a8b0650f00a5",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# Tutorial 4: Representational geometry & noise\n",
    "\n",
    "**Week 1, Day 3: Comparing Artificial And Biological Networks**\n",
    "\n",
    "**By Neuromatch Academy**\n",
    "\n",
    "__Content creators:__ Wenxuan Guo, Heiko Schütt\n",
    "\n",
    "__Content reviewers:__ Alish Dipani, Samuele Bolotta, Yizhou Chen, RyeongKyung Yoon, Ruiyi Zhang, Lily Chamakura, Hlib Solodzhuk\n",
    "\n",
    "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk, Patrick Mineault\n",
    "\n",
    "Acknowledgments: the tutorial outline was written by Heiko Schütt. The content was greatly improved by discussions with Heiko, Hlib, and Alish, and the insightful illustrations presented in the paper by Walther et al. (2016)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2dd91100-1262-458d-995b-4634f1feb2ea",
   "metadata": {
    "execution": {}
   },
   "source": [
    "___\n",
    "\n",
    "\n",
    "# Tutorial Objectives\n",
    "\n",
    "Estimated timing of tutorial: 45 minutes.\n",
    "\n",
    "This tutorial is about measuring the distances that define geometries. In many fields, distances must be estimated from limited and possibly noisy samples.\n",
    "In AI, we may want to compare models on the basis of samples of their units and inputs, and models may be stochastic. In computational neuroscience, we usually have recordings for samples of neurons, and data are affected by noise.\n",
    "\n",
    "By completing this tutorial, you will gain insights into:\n",
    "\n",
    "1. Generating simulated neural data with different noise distributions. This section will guide you through the process of creating simulated neural data and introducing variability in noise distributions. This step will illustrate how different noise levels can affect data representation and subsequent analyses.\n",
    "\n",
    "2. The Euclidean distance and Mahalanobis distance. Each of these metrics offers unique insights into the geometry of the data.\n",
    "\n",
    "3. The relationship between distance metrics and binary classification performance. This part of the tutorial emphasizes the relationship between distance measurements and the performance of binary classification tasks on a given pair of stimuli. Understanding this relationship will help us develop more accurate and robust classification models.\n",
    "\n",
    "4. The positive bias that arises when measuring distances from noisy pattern estimates and how cross-validation can correct this bias and give more accurate estimates of the underlying noise-free distance.\n",
    "\n",
    "5. Using random projections to estimate distances. This section introduces the Johnson–Lindenstrauss Lemma, which states that random projections maintain the integrity of distance estimates in a lower-dimensional space. This concept is crucial for reducing dimensionality while preserving the relational structure of the data.\n",
    "\n",
    "We will adhere to the notational conventions established by [Walther et al. (2016)](https://pubmed.ncbi.nlm.nih.gov/26707889/) for all discussed distance measures. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b7cdfb17-5f87-4bcf-ac34-5248b9ef755e",
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @markdown\n",
    "from IPython.display import IFrame\n",
    "from ipywidgets import widgets\n",
    "out = widgets.Output()\n",
    "with out:\n",
    "    print(f\"If you want to download the slides: https://osf.io/download/uwn2g/\")\n",
    "    display(IFrame(src=f\"https://mfr.ca-1.osf.io/render?url=https://osf.io/uwn2g/?direct%26mode=render%26action=download%26mode=render\", width=730, height=410))\n",
    "display(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14bde731-fa99-434e-9831-01b15ac5579b",
   "metadata": {
    "execution": {}
   },
   "source": [
    "---\n",
    "# Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Install dependencies and import feedback gadget\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6fd30101-61a3-4f72-a9f3-2ca16586464d",
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Install dependencies and import feedback gadget\n",
    "\n",
    "!pip install numpy xarray scipy scikit-learn matplotlib seaborn tqdm vibecheck datatops --quiet\n",
    "!pip install rsatoolbox==0.1.5 --quiet\n",
    "\n",
    "from vibecheck import DatatopsContentReviewContainer\n",
    "def content_review(notebook_section: str):\n",
    "    return DatatopsContentReviewContainer(\n",
    "        \"\",  # No text prompt\n",
    "        notebook_section,\n",
    "        {\n",
    "            \"url\": \"https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab\",\n",
    "            \"name\": \"neuromatch_neuroai\",\n",
    "            \"user_key\": \"wb2cxze8\",\n",
    "        },\n",
    "    ).render()\n",
    "\n",
    "feedback_prefix = \"W1D3_T4\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Import dependencies\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62ce1d24-edd9-487a-8f5d-4506c05a50a1",
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Import dependencies\n",
    "\n",
    "import logging\n",
    "from tqdm import tqdm\n",
    "from itertools import combinations\n",
    "import numpy as np\n",
    "import xarray as xr\n",
    "from scipy.stats import multivariate_normal\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n",
    "\n",
    "import rsatoolbox.data as rsd\n",
    "import rsatoolbox.rdm as rsr\n",
    "import scipy\n",
    "from scipy.spatial.distance import squareform\n",
    "\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.pylab as pl\n",
    "from matplotlib.colors import ListedColormap\n",
    "from mpl_toolkits.axes_grid1 import make_axes_locatable\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Figure settings\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90cf8ee2-9e0a-46c3-aca9-69974e66455f",
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Figure settings\n",
    "\n",
    "logging.getLogger('matplotlib.font_manager').disabled = True\n",
    "\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format = 'retina' # perfrom high definition rendering for images and plots\n",
    "plt.style.use(\"https://raw.githubusercontent.com/NeuromatchAcademy/course-content/main/nma.mplstyle\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Helper functions\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "127ad2ef-8b00-48c4-b8f2-1ecc7d58733d",
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Helper functions\n",
    "\n",
    "def compute_classifier_acc(classifier, neural_data, cov):\n",
    "    \"\"\"\n",
    "    Compute the accuracy of a classifier for all combinations of stimulus pairs in the neural data.\n",
    "\n",
    "    Parameters:\n",
    "    - neural_data (xarray.DataArray): neural data with dimensions \"stim\" and \"neuron\".\n",
    "\n",
    "    Returns:\n",
    "    - acc (xarray.DataArray): Accuracy matrix with dimensions \"stim1\" and \"stim2\".\n",
    "    \"\"\"\n",
    "    n_stimuli = len(np.unique(neural_data.stim.values))\n",
    "    coords = {\"stim1\": np.arange(n_stimuli), \"stim2\": np.arange(n_stimuli)}\n",
    "    acc_init = np.zeros([n_stimuli, n_stimuli])\n",
    "    acc = np2xr(acc_init, coords)\n",
    "    whitening_matrix = np.linalg.inv(np.linalg.cholesky(cov))\n",
    "    whitened_data = neural_data.values @ whitening_matrix\n",
    "    neural_data = xr.DataArray(whitened_data, dims=neural_data.dims, coords=neural_data.coords)\n",
    "\n",
    "    for i_stim_idx, j_stim_idx in tqdm(combinations(np.arange(n_stimuli),2)):\n",
    "        i_stim_pattern = neural_data.sel({\"stim\":i_stim_idx}) # xarray cannot select two non-unique indices at the same time\n",
    "        j_stim_pattern = neural_data.sel({\"stim\":j_stim_idx})\n",
    "        X = xr.concat([i_stim_pattern, j_stim_pattern], dim=\"stim\")\n",
    "        y = X.stim.values\n",
    "\n",
    "        cv = StratifiedKFold()\n",
    "        for i, (train_index, test_index) in enumerate(cv.split(X, y)):\n",
    "            classifier.fit(X[train_index].values, y[train_index])\n",
    "            pred = classifier.predict(X[test_index])\n",
    "            acc.loc[i_stim_idx, j_stim_idx] += np.sum((pred==y[test_index]))\n",
    "        acc.loc[i_stim_idx, j_stim_idx] = acc.loc[i_stim_idx, j_stim_idx]/len(y)\n",
    "        acc.loc[j_stim_idx, i_stim_idx] = acc.loc[i_stim_idx, j_stim_idx]\n",
    "\n",
    "    return acc\n",
    "\n",
    "def np2xr(data, coords):\n",
    "    \"\"\"\n",
    "    Convert numpy arrays to labelled xarrays.\n",
    "\n",
    "    Parameters:\n",
    "    - data (numpy.ndarray): The data array.\n",
    "    - coords (dict): A dictionary mapping dimension names to coordinate arrays.\n",
    "\n",
    "    Returns:\n",
    "    - xarray.DataArray: The labelled xarray.\n",
    "    \"\"\"\n",
    "    dims = list(coords.keys())\n",
    "    xarray = xr.DataArray(data, dims=dims, coords=coords)\n",
    "    return xarray\n",
    "\n",
    "def generate_activity_patterns(n_stimuli, n_neurons):\n",
    "    \"\"\"\n",
    "    Generate a stimulus x neural response matrix with reasonable Euclidean distances between different stimuli.\n",
    "\n",
    "    Parameters:\n",
    "    - n_stimuli (int): The number of stimuli.\n",
    "    - n_neurons (int): The number of neurons.\n",
    "\n",
    "    Returns:\n",
    "    - numpy.ndarray: The generated activity patterns of shape (n_stimuli, n_neurons).\n",
    "    \"\"\"\n",
    "    activity_patterns = np.random.default_rng(seed=0).uniform(low=0.0, high=2.0, size=(n_stimuli, n_neurons))\n",
    "    if n_neurons > 2:\n",
    "        # scale the neural response for each stimulus to make the stimuli discriminable for the 100-neuron case\n",
    "        scale = np.arange(1, n_stimuli+1) * 0.5\n",
    "        np.random.shuffle(scale)\n",
    "        activity_patterns *= scale.reshape(-1, 1)\n",
    "    return activity_patterns\n",
    "\n",
    "def repeat_first_row(arr):\n",
    "    \"\"\"repeat the activity patterns for the first stimuli\"\"\"\n",
    "    return np.vstack((arr,arr[0]))\n",
    "\n",
    "def generate_activity_patterns_wrapper(n_stimuli, n_neurons):\n",
    "    \"\"\"a wrapper for generating the activity patterns for a set of stimuli and labeling the data with xarray\"\"\"\n",
    "    activity_patterns = generate_activity_patterns(n_stimuli-1, n_neurons)\n",
    "    if n_neurons == 2:\n",
    "        cov = get_correlated_covariance(n_neurons)\n",
    "        v = find_contour_direction(cov, p=0.06) * 2\n",
    "        activity_patterns[1] = activity_patterns[0] + v\n",
    "        l = np.linalg.norm(v)\n",
    "        activity_patterns[2] = activity_patterns[0] + np.array([0, l])\n",
    "    activity_patterns = repeat_first_row(activity_patterns)\n",
    "    # convert to xarray\n",
    "    coords={\"stim\": np.arange(n_stimuli), \"neuron\": np.arange(n_neurons)}\n",
    "    activity_patterns = np2xr(activity_patterns, coords)\n",
    "    return activity_patterns\n",
    "\n",
    "def get_isotropic_covariance(n_neurons):\n",
    "    \"\"\"Generate an isotropic covariance matrix.\n",
    "\n",
    "    Parameters:\n",
    "    - n_neurons (int): The number of neurons.\n",
    "    - noise_std (float): The standard deviation of the noise.\n",
    "\n",
    "    Returns:\n",
    "    - numpy.ndarray: The isotropic covariance matrix of shape (n_neurons, n_neurons).\n",
    "    \"\"\"\n",
    "    noise_std = 1.0 if n_neurons==2 else 15.0\n",
    "    return np.identity(n_neurons) * noise_std**2\n",
    "\n",
    "def get_correlated_covariance(n_neurons):\n",
    "    \"\"\"\n",
    "    Generate a correlated covariance matrix using a Radial Basis Function (RBF) kernel.\n",
    "\n",
    "    Parameters:\n",
    "    - n_neurons (int): The number of neurons.\n",
    "    - length_scale (float): The length scale parameter for the RBF kernel.\n",
    "    - noise_amplitude (float, optional): The amplitude of the noise. Default is 1.0.\n",
    "\n",
    "    Returns:\n",
    "    - numpy.ndarray: The correlated covariance matrix of shape (n_neurons, n_neurons).\n",
    "    \"\"\"\n",
    "    if n_neurons == 2:\n",
    "        cov = np.array([1, 0.7, 0.7, 1]).reshape(2,2)*1.4 # make the neurons correlated\n",
    "    elif n_neurons > 2:\n",
    "        from sklearn.gaussian_process.kernels import RBF\n",
    "        noise_amplitude = 50.0\n",
    "        neuron_idx = np.arange(n_neurons).reshape(-1,1)\n",
    "        kernel = RBF(length_scale=30)\n",
    "        cov = kernel(neuron_idx)\n",
    "        cov = cov*noise_amplitude + np.identity(n_neurons) * 60\n",
    "    return cov\n",
    "\n",
    "def find_contour_direction(cov, p):\n",
    "    \"\"\"find the ellipse direction for a particular covariance matrix.\"\"\"\n",
    "    eigenvalues, eigenvectors = np.linalg.eig(cov)\n",
    "    assert eigenvalues[0]>np.max(eigenvalues[1:])\n",
    "    v0=eigenvectors[:,0]\n",
    "    return v0\n",
    "\n",
    "def add_correlated_noise(activity_patterns, cov, repetitions=50):\n",
    "    \"\"\"\n",
    "    Add correlated noise to the activity patterns.\n",
    "\n",
    "    Parameters:\n",
    "    - activity_patterns (numpy.ndarray): The activity patterns of shape (n_stimuli, n_neurons).\n",
    "    - cov (numpy.ndarray): The covariance matrix of shape (n_neurons, n_neurons).\n",
    "    - repetitions (int, optional): The number of repetitions. Default is 5.\n",
    "\n",
    "    Returns:\n",
    "    - numpy.ndarray: The activity patterns with added correlated noise of shape\n",
    "      (n_stimuli * repetitions, n_neurons).\n",
    "    \"\"\"\n",
    "\n",
    "    n_stimuli, n_neurons=activity_patterns.shape\n",
    "    activity_patterns = np.repeat(activity_patterns, repetitions, axis=0)\n",
    "    noise = np.random.multivariate_normal(mean=np.zeros(n_neurons), cov=cov, size=n_stimuli * repetitions)\n",
    "    activity_patterns += noise\n",
    "\n",
    "    return activity_patterns\n",
    "\n",
    "def generate_noisy_activity_patterns_wrapper(activity_patterns, repetitions):\n",
    "    \"\"\"Generate noisy activity patterns by adding isotropic and correlated noise to the given activity patterns.\n",
    "\n",
    "    Parameters:\n",
    "    - activity_patterns (numpy.ndarray): The original activity patterns of shape (n_stimuli, n_neurons).\n",
    "    - repetitions (int): The number of measurement repetitions.\n",
    "\n",
    "    Returns:\n",
    "    - isotropic_noised_data (xarray.DataArray): The activity patterns with added isotropic noise.\n",
    "    - isotropic_cov (numpy.ndarray): The isotropic covariance matrix.\n",
    "    - corr_noised_data (xarray.DataArray): The activity patterns with added correlated noise.\n",
    "    - correlated_cov (numpy.ndarray): The correlated covariance matrix.\n",
    "    \"\"\"\n",
    "    n_stimuli, n_neurons = activity_patterns.shape\n",
    "    isotropic_cov = get_isotropic_covariance(n_neurons)\n",
    "    correlated_cov = get_correlated_covariance(n_neurons, length_scale=n_neurons/20)\n",
    "\n",
    "    isotropic_noised_data = add_correlated_noise(activity_patterns, cov=isotropic_cov, repetitions = repetitions)\n",
    "    corr_noised_data = add_correlated_noise(activity_patterns, cov=correlated_cov, repetitions = repetitions)\n",
    "\n",
    "    coords = {\"stim\": np.arange(n_stimuli).repeat(repetitions), \"neuron\": np.arange(n_neurons)}\n",
    "    isotropic_noised_data = np2xr(isotropic_noised_data, coords)\n",
    "    corr_noised_data = np2xr(corr_noised_data, coords)\n",
    "\n",
    "    return isotropic_noised_data, isotropic_cov, corr_noised_data, correlated_cov\n",
    "\n",
    "def calc_rdm(neural_data, method='euclidean', noise=None, normalize_by_channels=True):\n",
    "    \"\"\"\n",
    "    Calculate the representational dissimilarity matrix (RDM) from neural data.\n",
    "\n",
    "    Parameters:\n",
    "    - neural_data (xarray.DataArray): Neural data with dimensions \"stim\" and \"neuron\".\n",
    "    - method (str): Dissimilarity measure to use for calculating the RDM. Default is 'euclidean'.\n",
    "    - noise (float or None): Noise level to add to the dissimilarities. Default is None.\n",
    "    - normalize_by_channels (bool): rsatoolbox normalize (divide) the distances by the number of channels.\n",
    "                                    set to False if raw squared euclidean distance is desired.\n",
    "\n",
    "    Returns:\n",
    "    - rdm (pyrsa.rdm.rdms.RDMs): representational dissimilarity matrix.\n",
    "    \"\"\"\n",
    "\n",
    "    dataset = rsd.Dataset(measurements=neural_data.values,\n",
    "                          obs_descriptors={\"stim\": neural_data.stim.values},\n",
    "                          channel_descriptors={\"neuron\": neural_data.neuron.values})\n",
    "    rdm = rsr.calc_rdm(dataset, method=method, noise=noise, descriptor='stim')\n",
    "    if not normalize_by_channels:\n",
    "        n_neurons=len(neural_data.neuron.values)\n",
    "        rdm.dissimilarities *= n_neurons\n",
    "    return rdm\n",
    "\n",
    "def vectorize_matrix(matrix):\n",
    "    \"\"\"\n",
    "    Extract the upper triangular part of a symmetric matrix.\n",
    "\n",
    "    Parameters:\n",
    "    - matrix (xarray.DataArray): a symmetric matrix.\n",
    "\n",
    "    Returns:\n",
    "    - numpy.ndarray: Vectorized values.\n",
    "    \"\"\"\n",
    "    n=matrix.shape[0]\n",
    "    return matrix.values[np.triu_indices(n, k = 1)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Plotting functions\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28b61a80-ceff-4af5-b122-d161b8ed8724",
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Plotting functions\n",
    "\n",
    "# visualize noise distributions\n",
    "def visualize_2d_noise(clean_dataset, isotropic_cov, correlated_cov, isotropic_noised_data, correlated_noised_data):\n",
    "    noise_dists = [\"isotropic\", \"correlated\"]\n",
    "    n_rows, n_cols = len(noise_dists), 2\n",
    "    fig, ax=plt.subplots(n_rows, n_cols, figsize=(10,8), sharex=True, sharey=True, dpi=135, layout='constrained')\n",
    "    n_neurons = 2\n",
    "    stim_idx = [0,1,2]\n",
    "    marker_types = [\"o\", \"X\", \"d\"]\n",
    "\n",
    "    alpha = 1\n",
    "    cmap = pl.cm.Blues\n",
    "    my_cmap = cmap(np.arange(cmap.N))\n",
    "    my_cmap[:,-1] = np.linspace(0, alpha, cmap.N)\n",
    "    my_cmap = ListedColormap(my_cmap)\n",
    "\n",
    "    for i_noise, noise_dist in enumerate(noise_dists):\n",
    "        cov = isotropic_cov if noise_dist=='isotropic' else correlated_cov\n",
    "        r = 3\n",
    "        x = np.arange(-r, r, 0.025)\n",
    "        y = np.arange(-r, r, 0.025)\n",
    "        X, Y = np.meshgrid(x, y)\n",
    "        pos = np.dstack((X, Y))\n",
    "        Z = multivariate_normal(mean=[0,0], cov=cov).pdf(pos)\n",
    "\n",
    "        for i_stim,stim in enumerate(clean_dataset.loc[dict(stim=stim_idx)]):\n",
    "            zorder=2 if i_stim==0 else 1\n",
    "            ax[i_noise,0].contourf(stim[0].values+X, stim[1].values+Y, Z, cmap=my_cmap,zorder=zorder)\n",
    "            ax[i_noise,0].scatter(stim[0], stim[1], color=\"darkorange\", marker=marker_types[i_stim],s=50,zorder=2)\n",
    "            x, y = clean_dataset.loc[i_stim].values\n",
    "            ax[i_noise,0].annotate(rf\"$s_{i_stim}$\", xy=(x-0.4,y+0.2), color=\"darkorange\", fontsize=10, zorder=2)\n",
    "            if i_stim==0:\n",
    "                cs=ax[i_noise,1].contourf(stim[0].values+X, stim[1].values+Y, Z, cmap=my_cmap, zorder=1)\n",
    "                ax[i_noise,1].scatter(stim[0], stim[1], color=\"darkorange\", marker=marker_types[i_stim],s=50,zorder=1)\n",
    "\n",
    "        x, y = clean_dataset.loc[0].values\n",
    "        dx1, dy1 = clean_dataset.loc[1].values - clean_dataset.loc[0].values\n",
    "        dx2, dy2 = clean_dataset.loc[2].values - clean_dataset.loc[0].values\n",
    "        ax[i_noise,0].arrow(x=x, y=y, dx=dx1*0.9, dy=dy1*0.9, linewidth=2, color=\"dimgray\", head_width=0.1, zorder=2)\n",
    "        ax[i_noise,0].arrow(x=x, y=y, dx=dx2*0.9, dy=dy2*0.9, linewidth=2, color=\"dimgray\", head_width=0.1, zorder=2)\n",
    "        ax[i_noise,0].set_xlim(-2,6)\n",
    "        ax[i_noise,0].set_ylim(-2,6)\n",
    "\n",
    "\n",
    "        ax[i_noise,0].annotate(f\"{noise_dist}\\nnoise\", xy=(-0.3, 0.5),\n",
    "                xycoords='axes fraction', ha='center', va='baseline',fontsize=10)\n",
    "\n",
    "        # noised_data\n",
    "        data = isotropic_noised_data if noise_dist=='isotropic' else correlated_noised_data\n",
    "        stim_data = data.loc[dict(stim=0)]\n",
    "        ax[i_noise,1].scatter(stim_data[:,0], stim_data[:,1], color=\"darkblue\", marker=marker_types[0], alpha=0.3, edgecolor='none')\n",
    "\n",
    "        for i_col in range(n_cols):\n",
    "            ax[i_noise, i_col].set_xlabel(\"neuron 1\", fontsize=9)\n",
    "            ax[i_noise, i_col].set_ylabel(\"neuron 2\", fontsize=9)\n",
    "\n",
    "    ax[0, 0].annotate(\"distance between two stimulus pairs\", xy=(0.5, 1.05),\n",
    "                    xycoords='axes fraction', ha='center', va='baseline',fontsize=10)\n",
    "    ax[0, 1].annotate(r\"noisy samples of neural response to $s_0$\", xy=(0.5, 1.05),\n",
    "                    xycoords='axes fraction', ha='center', va='baseline',fontsize=10)\n",
    "\n",
    "\n",
    "    # we only plot the colorbar for the whole plot\n",
    "    # because we set the covariance matrices for both isotropic and correlated noise such that their maximum probability density is around 0.16\n",
    "    cbar=fig.colorbar(cs, ax=ax.ravel().tolist(),fraction=0.02, pad=0.04)\n",
    "    cbar.set_label('probability density', rotation=270, labelpad=13)\n",
    "\n",
    "def visualize_100d_noise(isotropic_cov, correlated_cov):\n",
    "\n",
    "    with plt.xkcd():\n",
    "        fig,ax=plt.subplots(2, 2, sharey='row', figsize=(12,8), dpi=120)\n",
    "        noise_dists = [\"isotropic\", \"correlated\"]\n",
    "        n_samples=5\n",
    "\n",
    "        for i_noise, noise_dist in enumerate(noise_dists):\n",
    "            cov = isotropic_cov if noise_dist=='isotropic' else correlated_cov\n",
    "            noise = np.random.multivariate_normal(mean=np.zeros(100), cov=cov, size=n_samples)\n",
    "\n",
    "            # plot the covariance matrix\n",
    "            divider = make_axes_locatable(ax[0, i_noise])\n",
    "            cax = divider.append_axes('right', size='5%', pad=0.05)\n",
    "            cov = ax[0, i_noise].matshow(cov, cmap='bone')\n",
    "            fig.colorbar(cov, cax=cax, orientation='vertical')\n",
    "            ax[0,i_noise].set_title(noise_dist+\" noise\")\n",
    "\n",
    "            # plot the noise samples\n",
    "            for n in noise:\n",
    "                ax[1, i_noise].plot(n, color=sns.color_palette()[0], linewidth=1)\n",
    "            ax[1, i_noise].set_box_aspect(0.8)\n",
    "\n",
    "            ax[0, i_noise].set_xlabel(\"neuron index\")\n",
    "            ax[0, i_noise].set_ylabel(\"neuron index\")\n",
    "            ax[1, i_noise].set_xlabel(\"neuron index\")\n",
    "            ax[1, i_noise].set_ylabel(\"noise amplitude\")\n",
    "        ax[0,0].annotate(\"covariance\\nmatrix\", xy=(-0.4, 0.5), xycoords='axes fraction', ha='center', va='baseline',fontsize=10)\n",
    "        ax[1,0].annotate(f\"{n_samples} samples\\nof noise\", xy=(-0.35, 0.5), xycoords='axes fraction', ha='center', va='baseline',fontsize=10)\n",
    "\n",
    "def plot_accuracy_against_distance(acc, rdm_euclidean, rdm_mahalanobis):\n",
    "    with plt.xkcd():\n",
    "        fig,ax=plt.subplots(2, 2, figsize=(12,8), sharex='col', sharey=True, dpi=120)\n",
    "        for i_noise, (noise_type, acc_val) in enumerate(acc.items()):\n",
    "            for i_dist, distance in enumerate([\"Euclidean\", \"Mahalanobis\"]):\n",
    "                rdm = rdm_euclidean if distance == 'Euclidean' else rdm_mahalanobis\n",
    "                # make sure the rdm matrix and accuracy matrix organize the stimuli in the same order\n",
    "                assert rdm.pattern_descriptors['stim'] == acc_val.stim1.values.tolist()\n",
    "                x = rdm.dissimilarities.squeeze()\n",
    "                y = vectorize_matrix(acc_val)\n",
    "                sns.regplot(x=x, y=y, ax=ax[i_noise, i_dist], ci=None, scatter_kws={'alpha':0.7}, color=f\"C{i_dist}\")\n",
    "                r, p = scipy.stats.pearsonr(x, y)\n",
    "                ax[i_noise, i_dist].annotate('r={:.2f}, p={:.2g}'.format(r, p), xy=(0.2,0.9), xycoords='axes fraction', ha='center', va='baseline',fontsize=12)\n",
    "                ax[1, i_dist].set_xlabel(f\"squared {distance} distance\")\n",
    "                if i_dist == 0:\n",
    "                    ax[i_noise,0].annotate(noise_type + \"\\nnoise\", xy=(-0.25, 0.5), xycoords='axes fraction', ha='center', va='baseline',fontsize=12)\n",
    "\n",
    "            ax[i_noise,0].set_ylabel(\"decoding accuracy\")\n",
    "        plt.tight_layout()\n",
    "        return fig\n",
    "\n",
    "def plot_estimated_distance(ground_truth_rdm, noisy_rdm_euclidean, noisy_rdm_crossclidean, n_neurons=100):\n",
    "    with plt.xkcd():\n",
    "        fig, ax=plt.subplots(2, 2, figsize=(6.5,6), sharey='row', sharex='row',dpi=150)\n",
    "\n",
    "        for i, noise_dist in enumerate(['isotropic', 'correlated']):\n",
    "            for j, rdm in enumerate([noisy_rdm_euclidean, noisy_rdm_crossclidean]):\n",
    "                ylabel=\"squared Euclidean\" if j==0 else \"cross-validated squared Euclidean distance\"\n",
    "                # check the order of the stimuli in both RDMs is matched (note: rsatoolbox automatically sort stimuli based on values)\n",
    "                assert rdm[n_neurons][noise_dist].pattern_descriptors['stim'] == ground_truth_rdm[n_neurons].pattern_descriptors['stim']\n",
    "                ax[i,j].scatter(ground_truth_rdm[n_neurons].dissimilarities.squeeze(),\n",
    "                                rdm[n_neurons][noise_dist].dissimilarities.squeeze(),\n",
    "                                color=f\"C{i}\",s=20\n",
    "                            )\n",
    "                max_dist=np.ceil(rdm[n_neurons][noise_dist].dissimilarities.max())\n",
    "                ax[i,j].plot(np.arange(-1,100), np.arange(-1,100), linestyle='dashed', color='gray')\n",
    "                ax[i,j].set_ylabel(\"estimated squared Euclidean distance\\n(from noisy patterns)\", fontsize=7)\n",
    "                ax[i,j].set_xlabel(\"ground truth euclidean distance\\n(no noise)\", fontsize=7)\n",
    "                ax[i,j].set_xlim(0, max_dist+3)\n",
    "                ax[i,j].set_ylim(0, max_dist+3)\n",
    "                ax[i,j].tick_params(axis='both', which='major', labelsize=6)\n",
    "\n",
    "                title=\"squared Euclidean\" if j==0 else \"cross-validated squared Euclidean\"\n",
    "\n",
    "                if i==0:\n",
    "                    # ax[i,j].annotate(noise_dist, xy=(0.5, 1.05),\n",
    "                    #             xycoords='axes fraction', ha='center', va='baseline',fontsize=9)\n",
    "                    ax[i,j].set_title(title, fontsize=12)\n",
    "\n",
    "            ax[i,0].annotate(f\"{noise_dist}\\nnoise\", xy=(-0.4, 0.5),\n",
    "                        xycoords='axes fraction', ha='center', va='baseline',fontsize=9)\n",
    "        plt.tight_layout()\n",
    "\n",
    "def plot_distance_after_projection(true_dist, projected_dist, n_neurons_list, n_dims_list):\n",
    "    with plt.xkcd():\n",
    "        fig, ax=plt.subplots(1, 2, figsize=(6,2.5), dpi=200)\n",
    "        for i, n_neurons in enumerate(n_neurons_list):\n",
    "            if n_neurons == 100:\n",
    "                projected_dist[n_neurons] = projected_dist[n_neurons]/100\n",
    "                true_dist[n_neurons] = true_dist[n_neurons]/100\n",
    "            ax[i].scatter(x = n_dims_list, y=projected_dist[n_neurons],s=15)\n",
    "            ax[i].set_ylabel(\"squared euclidean distance\\nafter random projection\", fontsize=6)\n",
    "            ax[i].set_xlabel(\"dimensionality\", fontsize=7)\n",
    "            ax[i].tick_params(axis='both', which='major', labelsize=5)\n",
    "            ax[i].axhline(y=true_dist[n_neurons], linestyle=\"dashed\", color=\"gray\")\n",
    "            ax[i].text(n_dims_list[-1], true_dist[n_neurons], 'true euclidean distance', color='gray', ha='right', va='top', fontsize=4)\n",
    "            title = \"two neurons\" if n_neurons == 2 else \"100 neurons\"\n",
    "            ax[i].set_title(title, fontsize=7)\n",
    "        plt.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Video 1: Tutorial Introduction\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d478eccc-37b8-4cc8-b80b-4647688e0139",
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Video 1: Tutorial Introduction\n",
    "\n",
    "from ipywidgets import widgets\n",
    "from IPython.display import YouTubeVideo\n",
    "from IPython.display import IFrame\n",
    "from IPython.display import display\n",
    "\n",
    "class PlayVideo(IFrame):\n",
    "  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
    "    self.id = id\n",
    "    if source == 'Bilibili':\n",
    "      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
    "    elif source == 'Osf':\n",
    "      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
    "    super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
    "\n",
    "def display_videos(video_ids, W=400, H=300, fs=1):\n",
    "  tab_contents = []\n",
    "  for i, video_id in enumerate(video_ids):\n",
    "    out = widgets.Output()\n",
    "    with out:\n",
    "      if video_ids[i][0] == 'Youtube':\n",
    "        video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
    "                             height=H, fs=fs, rel=0)\n",
    "        print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
    "      else:\n",
    "        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
    "                          height=H, fs=fs, autoplay=False)\n",
    "        if video_ids[i][0] == 'Bilibili':\n",
    "          print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
    "        elif video_ids[i][0] == 'Osf':\n",
    "          print(f'Video available at https://osf.io/{video.id}')\n",
    "      display(video)\n",
    "    tab_contents.append(out)\n",
    "  return tab_contents\n",
    "\n",
    "video_ids = [('Youtube', '2cMfsTw7YsI'), ('Bilibili', 'BV1Tx4y1t7nL')]\n",
    "tab_contents = display_videos(video_ids, W=730, H=410)\n",
    "tabs = widgets.Tab()\n",
    "tabs.children = tab_contents\n",
    "for i in range(len(tab_contents)):\n",
    "  tabs.set_title(i, video_ids[i][0])\n",
    "display(tabs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Submit your feedback\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc876284-c738-4556-ad15-f3cd25650a7f",
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Submit your feedback\n",
    "content_review(f\"{feedback_prefix}_tutorial_introduction\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54d5e296-9674-4260-a242-9026cd131114",
   "metadata": {
    "execution": {}
   },
   "source": [
    "---\n",
    "# Section 1: Simulate neural data and visualize noise distributions\n",
    "\n",
    "Estimated timing to here from start of tutorial: 15 minutes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c316afe-1ee7-46eb-959b-3a9e0d7fc1c9",
   "metadata": {
    "execution": {}
   },
   "source": [
    "<img src=\"https://github.com/neuromatch/NeuroAI_Course/blob/main/tutorials/W1D3_ComparingArtificialAndBiologicalNetworks/static/response_matrix.png?raw=true\" width=600 />"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa392b7f-ed8f-44e7-87de-27aa0ae2a083",
   "metadata": {
    "execution": {}
   },
   "source": [
    "We will start by generating two neural datasets in response to a set of stimuli, represented as a stimulus-response matrix (stimuli x neurons). The first dataset will be low-dimensional, containing only two neurons, to help with illustration and visualization. The second dataset will be high-dimensional, containing 100 neurons. Each will have 10 stimuli. \n",
    "\n",
    "The mean activity pattern of each stimulus (each row of the matrix) is generated by randomly sampling from a uniform distribution. To ensure that the differences between the mean activity of different stimuli are distinct for later illustration, we scale each activity pattern by a constant for the 100-neuron dataset. This adds some structure to the simulated representational geometry as observed in neural data, whereas, for randomly sampled patterns in a high-dimensional space, the patterns of each pair tend to be similarly far apart. The implementations can be found above in the `generate_activity_patterns` function. \n",
    "\n",
    "Then, we add two types of additive noise to the mean activity patterns. The first type of noise is Gaussian and independent between neurons (**isotropic**) and stimuli (**homoscedastic**). The second type of noise is Gaussian and independent between stimuli (**homoscedastic**) but correlated between neurons (**nonisotropic**). *Note that we are not aiming to simulate biologically plausible neural activity patterns, which would have different noise distributions.* The noise we add may lead to negative response values in the neural data.\n",
    "\n",
    "We will provide various visualizations of the noise to help build intuitions. If you are curious about the implementations, please refer to the `get_isotropic_covariance` and `get_correlated_covariance` functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f3299e47-ee31-4152-a59b-22ddd582fbf2",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "n_stimuli=10\n",
    "n_neurons_list=[2, 100]\n",
    "\n",
    "clean_dataset={}\n",
    "for n_neurons in n_neurons_list:\n",
    "    clean_dataset[n_neurons] = generate_activity_patterns_wrapper(n_stimuli=n_stimuli, n_neurons=n_neurons)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f329c122-077d-4d58-87e0-31f07915a465",
   "metadata": {
    "execution": {}
   },
   "source": [
    "To simulate noisy measurements, we introduce isotropic or correlated noise to the clean (true) activity patterns and repeat this process to obtain multiple noisy simulated neural measurements of responses to the same stimulus. This is analogous to a neurophysiological experiment where we repeatedly measure the response pattern to each stimulus: each measurement reflects the same true pattern but is affected by new noise.\n",
    "\n",
    "## Coding Exercise 1\n",
    "\n",
    "How would you implement isotropic or correlated Gaussian noise for two neurons? Create a covariance matrix for each type of noise:\n",
    "\n",
    "* For the isotropic Gaussian, let the variance of each neuron be 1.\n",
    "* For the correlated Gaussian, let the variance of each neuron be 1 and the covariance be 0.6."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74c14bed-0207-4274-b341-3116d5a05583",
   "metadata": {
    "execution": {}
   },
   "source": [
    "**Hint**: Let our covariance matrix be $\\Sigma$. The diagonal entry $\\Sigma_{i,i}$ represents the variance of element $i$, and the off-diagonal entry $\\Sigma_{i,j}$ (where $i\\neq j$) represents the covariance between two elements $i$ and $j$. As a result, isotropic Gaussian noise can be represented by a diagonal matrix (where off-diagonal entries are 0). The correlated covariance matrix can be represented by a symmetric non-diagonal matrix ($\\Sigma_{i,j}=\\Sigma_{j,i})$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69495198-8849-4357-b874-5a652aa0613d",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "```python\n",
    "#################################################\n",
    "## TODO for students: fill in the missing variables ##\n",
    "raise NotImplementedError(\"Student exercise\")\n",
    "#################################################\n",
    "n_neurons = 2\n",
    "isotropic_cov_2d = ...\n",
    "correlated_cov_2d = ...\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a06fd163-7975-47f7-ae5a-075ca750a2ea",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "# to_remove solution\n",
    "n_neurons = 2\n",
    "isotropic_cov_2d = np.identity(n_neurons)\n",
    "correlated_cov_2d = np.array([[1.,0.6],[0.6,1.0]])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82b0fd5e-559f-4376-a670-056bba042d62",
   "metadata": {
    "execution": {}
   },
   "source": [
    "We have implemented the covariance matrices for you. Run the following code block to generate 100 noisy neural responses to the 10 stimuli for different correlation structures."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Generate data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1fa28f6b-f47a-48e6-b029-7169d1325241",
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Generate data\n",
    "\n",
    "repetitions=100\n",
    "\n",
    "isotropic_noised_data = {}\n",
    "isotropic_cov = {} # covariance matrix for the isotropic noise\n",
    "correlated_noised_data = {}\n",
    "correlated_cov = {}\n",
    "for n_neurons in n_neurons_list:\n",
    "    coords = {\"stim\": np.arange(n_stimuli).repeat(repetitions), \"neuron\": np.arange(n_neurons)}\n",
    "\n",
    "    # add isotropic noise\n",
    "    isotropic_cov[n_neurons] = get_isotropic_covariance(n_neurons)\n",
    "    isotropic_noised_data[n_neurons] = add_correlated_noise(clean_dataset[n_neurons],\n",
    "                                                            cov=isotropic_cov[n_neurons],\n",
    "                                                            repetitions = repetitions)\n",
    "    isotropic_noised_data[n_neurons] = np2xr(isotropic_noised_data[n_neurons], coords)\n",
    "\n",
    "    # add correlated noise\n",
    "    correlated_cov[n_neurons] = get_correlated_covariance(n_neurons)\n",
    "    correlated_noised_data[n_neurons] = add_correlated_noise(clean_dataset[n_neurons],\n",
    "                                                             cov=correlated_cov[n_neurons],\n",
    "                                                             repetitions = repetitions)\n",
    "    correlated_noised_data[n_neurons] = np2xr(correlated_noised_data[n_neurons], coords)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fe90774-d313-44e1-b82e-7d37e0daff3b",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Let's visualize the effect of the noise distribution on the distances between the stimuli using the two-neuron dataset. \n",
    "\n",
    "We can represent the noise distribution using a contour plot centered around the mean stimulus response. The contour plot represents a three-dimensional surface on a two-dimensional plane. In this case, the two dimensions represent the responses of the two neurons, and the contours indicate the probability density of the noise distribution. Darker colors indicate higher probability density. As we move further away from the mean response for a particular stimulus, the probability density decreases, meaning that the likelihood of observing a neural response at that point is lower. \n",
    "\n",
    "We have generated the clean activity patterns such that the Euclidean distances between stimulus pairs [$s_0$, $s_1$] and [$s_0$, $s_2$] are the same (see the arrows in the left panels). However, depending on the noise distribution, the discriminability of the stimulus pairs is different:\n",
    "\n",
    "- **Isotropic noise** (first row): the likelihood of observing the mean neural response of $s_1$ and $s_2$, given that the stimulus shown is $s_0$, is the same. The discriminability between a pair of stimuli is precisely defined by the Euclidean distance (Kriegeskorte & Diedrichsen, 2019).\n",
    "- **Correlated noise** (second row): even though the Euclidean distance between the two stimulus pairs is the same, the discriminability between $s_0$ and $s_1$ is lower than between $s_0$ and $s_2$. This is because the direction of the noise aligns with the signal direction for $s_0$ and $s_1$ (the line connecting their mean responses). As a result, the mean stimulus response of $s_1$ lies within the colored contours of stimulus $s_0$ (the probability density lies roughly between 0.06 and 0.08), while the mean stimulus response of $s_2$ lies outside the contours (which represent the regions where the probability density is close to 0). Namely, when the stimulus shown is $s_0$, the likelihood of observing a neural response that coincides with the mean response of $s_1$ is much higher than the likelihood of observing a neural response that coincides with the mean response of $s_2$. As we will see later, the discriminability depends on the Mahalanobis distance that takes into account the noise covariances between the neurons."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ee3ebd2-9600-40d6-8579-4f987672ca52",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "n_neurons = 2\n",
    "visualize_2d_noise(clean_dataset[n_neurons], isotropic_cov[n_neurons], correlated_cov[n_neurons], isotropic_noised_data[n_neurons], correlated_noised_data[n_neurons])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad553536-db37-4e89-9d38-e62f47b4f254",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Let's create covariance matrices for the 100 neuron dataset. For isotropic noise, the covariance between different neurons remains 0, resulting in a diagonal covariance matrix. To create a correlated covariance matrix, we imagine that the neurons are spatially arranged in a line, and their correlation decays with distance:\n",
    "\n",
    "$$\\Sigma_{i,j}=\\exp\\Big(-\\dfrac{||i-j||^2_2}{2l^2}\\Big)$$\n",
    "\n",
    "Here $||i-j||_2$ represents the Euclidean distance between indices $i$ and $j$, and $l$ is a length scale parameter. Assuming that the matrix indices indicate the neurons' positions in the brain, neighboring neurons tend to be correlated. To ensure that the correlated covariance matrix is well-conditioned and not singular, we added a small constant value to its diagonal entries.\n",
    "\n",
    "Note: we intentionally set the noise magnitude to be large so that decoding the neural responses would not be perfect. This allows us to better illustrate the relationship between decoding accuracy and the distances between stimuli in Section 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a173d51-2cf9-4e40-8957-6d4cc8095c2c",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "n_neurons = 100\n",
    "visualize_100d_noise(isotropic_cov[n_neurons], correlated_cov[n_neurons])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Submit your feedback\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ba96089-e909-4c36-8768-d1706e5fb3dd",
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Submit your feedback\n",
    "content_review(f\"{feedback_prefix}_noise_representation\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c220c664-10a5-4ff4-a8db-b394506e7225",
   "metadata": {
    "execution": {}
   },
   "source": [
    "---\n",
    "# Section 2: Distances and discriminability between a pair of stimuli\n",
    "\n",
    "Estimated timing to here from start of tutorial: 30 minutes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "685cb4ad-6806-406f-9372-72e74d6f2c99",
   "metadata": {
    "execution": {}
   },
   "source": [
    "As we alluded to earlier, for a pair of stimuli, there is a strong dependence between their distance and their discriminability (defined by the performance of a binary classifier for the stimulus pair). Moreover, matching the type of distance computed (Euclidean vs. Mahalanobis) and the noise distribution (isotropic vs. correlated) is important. <be>\n",
    "\n",
    "Let's briefly review how to compute the Euclidean and Mahalanobis distance. Let the activity patterns of stimuli $j$ and $k$ be $\\mathbf{b_j}$ and $\\mathbf{b_k}$ respectively.\n",
    "- The *squared* Euclidean distance is:\n",
    "  \n",
    "$$d^2_{\\text{Euclidean}}=||\\mathbf{b_i}-\\mathbf{b_j}||^2=(\\mathbf{b_i} - \\mathbf{b_j})(\\mathbf{b_i} - \\mathbf{b_j})^T$$\n",
    "\n",
    "- The *squared* mahalanobis distance is:\n",
    "\n",
    "$$d^2_{\\text{Mahalanobis}}=(\\mathbf{b_i} - \\mathbf{b_j})\\Sigma^{-1}(\\mathbf{b_i} - \\mathbf{b_j})^T$$\n",
    "\n",
    "where $\\Sigma$ is the covariance matrix across the neurons. Taking the inverse of the covariance matrix makes the noise approximately independent and identically distributed. Intuitively, the noisier a neuron is, the more we down-weight its response.\n",
    "Note: If we have repeated measurements for some stimulus $j$ ($\\mathbf{b_j^1}, \\mathbf{b_j^2}, ..., \\mathbf{b_j^n}$), we take the average and compute $\\mathbf{b_j}=\\dfrac{1}{n}\\sum_{i=1}^n \\mathbf{b_j^i}$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cef12b2-f6a7-4a63-ad6f-ac65ce2572b2",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Let's first train a binary classifier for each pair of stimuli and calculate the decoding accuracy. We will use the **Fisher Linear Discriminant Analysis**. \n",
    "\n",
    "To classify two stimuli $i$ and $j$, the Fisher discriminant decision criterion is a threshold on the dot product $\\mathbf{wb}^T > c$, where $\\mathbf{w}=(\\mathbf{b_i}-\\mathbf{b_j})_{\\text{train}}\\Sigma_{\\text{train}}^{-1}$ is the weight vector and $\\mathbf{b}$ is the stimulus pattern we want to classify.\n",
    "\n",
    "Execute this cell below to calculate the classifier performance on the noisy data (generated by adding either isotropic or correlated noise)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e8828f5b-d844-46c5-933d-488bcbd834e3",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "classifier = LinearDiscriminantAnalysis()\n",
    "acc = {}\n",
    "for n_neurons in n_neurons_list:\n",
    "    print(f\"Fitting for {n_neurons} neurons\")\n",
    "    acc[n_neurons]={}\n",
    "    acc[n_neurons]['isotropic'] = compute_classifier_acc(classifier, isotropic_noised_data[n_neurons], cov=isotropic_cov[n_neurons])\n",
    "    acc[n_neurons]['correlated'] = compute_classifier_acc(classifier, correlated_noised_data[n_neurons], cov=correlated_cov[n_neurons])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d950a4d4-6fd0-47de-b13c-030142347396",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Let's calculate the *squared* Euclidean and Mahalanobis distances between the **clean neural activity** patterns to study the relationship between these distances and classification accuracy.\n",
    "\n",
    "## Coding exercise 2: Distance-metrics comparison\n",
    "\n",
    "In this coding exercise, you'll compute the *squared* Euclidean and Mahalanobis distance for *one pair* of stimuli in the 2-neuron dataset.\n",
    "\n",
    "You might be wondering why we are calculating the Mahalanobis distance for clean neural patterns, given that they are inherently noiseless. The reason is that our binary decoders are trained to classify noisy data. By computing the Mahalanobis distance between the clean neural activity patterns, we take the covariances into account, which allows us to better predict decoding accuracy (as you will see below). Consider the clean neural activity patterns as the mean neural response obtained from repeated measurements. If we collect many noisy measurements, our mean neural response will converge to the clean neural pattern."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8023e4a2-403c-488a-b296-10e077d6fbc1",
   "metadata": {
    "execution": {}
   },
   "source": [
    " The equations are provided again below:\n",
    " \n",
    "$$d^2_{\\text{Euclidean}}=||\\mathbf{b_j}-\\mathbf{b_k}||^2=(\\mathbf{b_i} - \\mathbf{b_j})(\\mathbf{b_i} - \\mathbf{b_j})^T$$\n",
    "\n",
    "$$d^2_{\\text{Mahalanobis}}=(\\mathbf{b_i} - \\mathbf{b_j})\\Sigma^{-1}(\\mathbf{b_i} - \\mathbf{b_j})^T$$ \n",
    "\n",
    "where $\\Sigma$ is the covariance matrix across the neurons."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1af582aa-d340-459f-8595-ce62f72fb3c4",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "```python\n",
    "n_neurons = 2\n",
    "stimulus_idx = 0,1 # choose two stimuli\n",
    "#################################################\n",
    "raise NotImplementedError(\"Student exercise: complete Euclidean distance calculation by the formula provided above\")\n",
    "#################################################\n",
    "b_j = clean_dataset[n_neurons].loc[stimulus_idx[0]].values # select the stimulus response\n",
    "b_k = clean_dataset[n_neurons].loc[stimulus_idx[1]].values\n",
    "# compute the squared euclidean and mahalanobis distance, and then divide the distance by the number of neurons (2)\n",
    "euclidean_dist = ...\n",
    "mahalanobis_dist = ((b_j-b_k) @ np.linalg.inv(correlated_cov[n_neurons]) @ (b_j-b_k).T) / n_neurons\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3037225a-9294-47a2-835f-35d498a234e1",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "#to_remove solution\n",
    "n_neurons = 2\n",
    "stimulus_idx = 0,1 # choose two stimuli\n",
    "b_j = clean_dataset[n_neurons].loc[stimulus_idx[0]].values # select the stimulus response\n",
    "b_k = clean_dataset[n_neurons].loc[stimulus_idx[1]].values\n",
    "# compute the squared euclidean and mahalanobis distance, and then divide the distance by the number of neurons (2)\n",
    "euclidean_dist = ((b_j-b_k) @ (b_j-b_k).T) / n_neurons\n",
    "mahalanobis_dist = ((b_j-b_k) @ np.linalg.inv(correlated_cov[n_neurons]) @ (b_j-b_k).T) / n_neurons"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "437ba224-0871-46e6-983d-efe688fdbbbb",
   "metadata": {
    "execution": {}
   },
   "source": [
    "### Coding Exercise 2 Discussion"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7c68e31-5354-43bd-b74a-8674a556faf6",
   "metadata": {
    "execution": {}
   },
   "source": [
    "1. For isotropic Gaussian noise, what is the relationship between Euclidean and Mahalanobis distance?\n",
    "**Hint**: Check the equations above. In the special case of an identity covariance matrix, i.e., $\\Sigma=I$, what is the inverse of $\\Sigma$?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f65bc8b-0b5a-4902-9a01-33ea96dcbbe9",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "#to_remove explanation\n",
    "\n",
    "\"\"\"\n",
    "Discussion: For isotropic Gaussian noise, what is the relationship between Euclidean and Mahalanobis distance?\n",
    "\n",
    "For isotropic Gaussian noise, the Euclidean distance and Mahalanobis distance are equivalent up to a constant factor. When the covariance matrix is an identity matrix, the Euclidean and Mahalanobis distances are exactly equal.\n",
    "\"\"\";"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78e372a1-9531-48c4-9218-5dad31c37131",
   "metadata": {
    "execution": {}
   },
   "source": [
    "The distances for all stimulus pairs in the 2-neuron and 100-neuron datasets are computed below. The`calc_rdm` function is a wrapper around the `rsatoolbox.data.Dataset` and `rsatoolbox.rdm.calc_rdm` modules in the [rsatoolbox](https://github.com/rsagroup/rsatoolbox) package. The rsatoolbox automatically computes distances for all possible pairs of stimuli. Namely, if our `clean_dataset` is an array of $k$ stimuli by $n$ neurons, then `rsatoolbox.rdm.calc_rdm` computes $k(k-1)/2$ distances."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5bc66507-3eff-4e28-9855-81ecdc5f7b26",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "rdm_euclidean, rdm_mahalanobis = {}, {}\n",
    "for n_neurons in n_neurons_list:\n",
    "    rdm_euclidean[n_neurons] = calc_rdm(clean_dataset[n_neurons], method='euclidean')\n",
    "    rdm_mahalanobis[n_neurons] = calc_rdm(clean_dataset[n_neurons], method='mahalanobis', noise=np.linalg.inv(correlated_cov[n_neurons])) # plotting decoding accuracy from isotropic noise data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "498fde1d-8b94-4cb9-9508-d8e53420a024",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Verify that the distances you just computed correspond to the ones calculated by rsatoolbox. Note that the `calc_rdm` function normalizes the distance by the number of channels (e.g., divides the distance by 2 for the 2-neuron dataset), so please divide your distance by the number of neurons (2) as well.\n",
    "\n",
    "You can obtain the distances by accessing the `dissimilarities` property in the namespace (`rdm_euclidean[n_neurons].dissimilarities`). The `dissimilarities` are ordered by the `pattern_descriptors`. If the pattern descriptor shows [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], then the $9(9-1)/2=45$ distances are ordered as $d_{0,1}$, $d_{0,2}$, $d_{0,3}$,..., $d_{i,i+1}$, $d_{i,i+2}$, ..., $d_{8,9}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "878777ae-93f0-4667-afdc-ccf4aa15bd87",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "rdm_euclidean[2] # access dissimilarities by rdm_euclidean[2].dissimilarities"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19a94f09-c82a-43a6-9524-9899ff38e4ff",
   "metadata": {
    "execution": {}
   },
   "source": [
    "For each pair of stimuli, we can plot the decoding accuracy and the distance between them. We will generate four plots for each dataset, two noise distributions (isotropic or correlated) $\\times$ two distance measures (Euclidean or mahalanobis)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "865f440b-26f1-4edf-b981-2b6c17fdceb2",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "n_neurons = 2 # change to 100 to visualize the relationship between distance and decoding accuracy for the 100-neuron dataset\n",
    "fig = plot_accuracy_against_distance(acc[n_neurons], rdm_euclidean[n_neurons], rdm_mahalanobis[n_neurons])\n",
    "fig.suptitle(f\"{n_neurons} neurons\", fontsize=15)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ecb68bb-6301-4f99-a6a8-9be2f454f027",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Notice that:\n",
    "- For a decoder trained to classify data with isotropic noise, the Euclidean distance predicts decoding accuracy better (higher correlation). The Mahalanobis distance predictions are less accurate because this distance uses the correlated covariance, which doesn't reflect the isotropic noise.\n",
    "- For a decoder trained to classify data with correlated noise, the Mahalanobis distance that takes into account the correct noise covariance predicts decoding accuracy better."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "####  Submit your feedback\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c87fcf79-413f-45f9-bb3c-5d80dc57171e",
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Submit your feedback\n",
    "content_review(f\"{feedback_prefix}_distance_metrics_comparison\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbfbc0a1-ed25-4d7d-8b56-48d19aa85aef",
   "metadata": {
    "execution": {}
   },
   "source": [
    "---\n",
    "# Section 3: Cross-validated distances prevent the inflation of distance estimates by noise\n",
    "\n",
    "Estimated timing to here from start of tutorial: 40 minutes\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25607d79-4f70-4af5-8b63-50decfa5277b",
   "metadata": {
    "execution": {}
   },
   "source": [
    "If we calculate the Euclidean distance between the **noisy** activity patterns of each pair of stimuli, we will observe that it's higher than in the no-noise condition. This is especially visible in the 100-neuron dataset. \n",
    "\n",
    "To understand this positive bias of distances, imagine two activity patterns that are, in truth, identical. With the addition of the noise, the estimated distance between the noisy data will always be larger than 0. The noise makes the patterns dissimilar and inflates the distance (Walther et al. 2016). This effect is particularly pronounced in high-dimensional data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b843623c-9160-4887-9979-3efa7f358d11",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Let's first calculate the squared Euclidean distances between the noisy activity patterns of each stimulus pair."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e311449f-719d-4dcb-863d-915b16d13443",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "noisy_rdm_euclidean = {}\n",
    "for n_neurons in n_neurons_list:\n",
    "    noisy_rdm_euclidean[n_neurons] = {}\n",
    "    noisy_rdm_euclidean[n_neurons]['isotropic'] = calc_rdm(isotropic_noised_data[n_neurons], method='euclidean')\n",
    "    noisy_rdm_euclidean[n_neurons]['correlated'] = calc_rdm(correlated_noised_data[n_neurons], method='euclidean')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69fac648-16b7-498b-8c81-9225e12c147e",
   "metadata": {
    "execution": {}
   },
   "source": [
    "To obtain an unbiased estimate, we can split the data into independent sets and cross-validate the difference between patterns across the two sets (Allefeld and Haynes, 2014; Nili et al. 2014).\n",
    "\n",
    "The cross-validated squared Euclidean distance–the so-called *crossclidian*–between two activity patterns $\\mathbf{b_i}$ and $\\mathbf{b_j}$ can be computed as: \n",
    "\n",
    "$$d^2_{\\text{Euclidean, cross-validated}}=(\\mathbf{b_i} - \\mathbf{b_j})_\\text{train}(\\mathbf{b_i} - \\mathbf{b_j})_\\text{test}^T$$\n",
    "\n",
    "\n",
    "where we partition the repeated measurements of the activity patterns into a training and testing set before computing the difference vectors independently.\n",
    "\n",
    "`rsatoolbox` has an implementation of the cross-validated distance. The general distance measure is called *crossnobis*, short for *cross-validated mahalanobis distance*. \n",
    "\n",
    "$$d^2_{\\text{Mahalanobis, cross-validated}}=(\\mathbf{b_i} - \\mathbf{b_j})_\\text{train}\\Sigma_{\\text{train}}^{-1}(\\mathbf{b_i} - \\mathbf{b_j})_\\text{test}^T$$\n",
    "\n",
    "If we assume the covariance noise structure is an identity matrix, then the crossnobis distance is equivalent to the cross-validated Euclidean distance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9afb58f4-27f0-420d-b16d-28e6d11d4794",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "noisy_rdm_crossclidean = {}\n",
    "for n_neurons in n_neurons_list:\n",
    "    noisy_rdm_crossclidean[n_neurons] = {}\n",
    "    noisy_rdm_crossclidean[n_neurons]['isotropic'] = calc_rdm(isotropic_noised_data[n_neurons], method='crossnobis', noise=None)\n",
    "    noisy_rdm_crossclidean[n_neurons]['correlated'] = calc_rdm(correlated_noised_data[n_neurons], method='crossnobis', noise=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8de110a-155c-4ca2-aee7-5aacb1205ed0",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Let's now plot the squared Euclidean distance and the cross-validated squared Euclidean distance against the true Euclidean distance for the 100-neuron dataset. Points falling on the diagonal line indicate an unbiased estimate of the distance; points above the diagonal line indicate an overestimation of the distance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f44f812e-849f-4278-b887-4045112ceff6",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "plot_estimated_distance(rdm_euclidean, noisy_rdm_euclidean, noisy_rdm_crossclidean, n_neurons=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc2375c6-f5fb-4c7c-b60d-6d7e87d91b2b",
   "metadata": {
    "execution": {}
   },
   "source": [
    "In section 2, we used the Fisher Linear Discriminant Analysis (LDA) to decode stimuli and calculated decoding accuracy. The weight vector of the Fisher linear discriminant $\\mathbf{w}=(\\mathbf{b_i}-\\mathbf{b_j})_{\\text{train}}\\Sigma_{\\text{train}}^{-1}$. Do you notice any similarity with the cross-validated Mahalanobis distance $(\\mathbf{b_i} - \\mathbf{b_j})_\\text{train}\\Sigma_{\\text{train}}^{-1}(\\mathbf{b_i} - \\mathbf{b_j})_\\text{test}^T$?\n",
    "In fact, if the test dataset only consists of one observation from each of the two classes ($\\mathbf{b_i^{\\textbf{test}}}$ and $\\mathbf{b_j}^{\\textbf{test}}$) and we subtract the mean pattern from both the training and the test dataset, then both observations in the test set will be correctly classified if $(\\mathbf{b_i} - \\mathbf{b_j})_\\text{train}\\Sigma_{\\text{train}}^{-1}(\\mathbf{b_i} - \\mathbf{b_j})_\\text{test}^T>0$. \n",
    "The cross-validated Mahalanobis (crossnobis) distance is closely related to the Fisher linear discriminant and is also known as the *linear discriminant contrast* (*LDC*). In LDA, the discriminant makes a binary classification for each stimulus-related response measurement. Computing the accuracy requires thresholding, a form of discretization, and the accuracy will saturate at 100% if the two means are far apart. The crossnobis estimator (=LDC), by contrast, provides a continuous quantification of the discriminability between stimulus classes, avoiding discretization and saturation (Walther et al. 2016)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Submit your feedback\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2ae60cc-e477-4b10-8326-b7cb60ec33aa",
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Submit your feedback\n",
    "content_review(f\"{feedback_prefix}_cross_validated_distances_prevent_the_inflation_of_distance_estimates_by_noise\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b29dfe6-dec4-4514-b4c3-4bda9055a480",
   "metadata": {
    "execution": {}
   },
   "source": [
    "---\n",
    "# Section 4: The Johnson-Lindenstrauss Lemma\n",
    "\n",
    "The Johnson-Lindenstrauss Lemma says that random projections approximately preserve Euclidean distances. In particular, to approximately capture the distances of points in a very high-dimensional space, such as a neural response space (many neurons!), we do not necessarily need to measure all neurons. We can estimate the neural-population distances from a random sample of neurons. In fact, the number of random projections that suffices to achieve a given distortion $\\epsilon$ is not dependent on the number of original dimensions (neurons here), but only on the number of points (stimuli here).\n",
    "\n",
    "We will see that we can preserve the distances quite well when embedding high-dimensional points in a space of much lower dimension.\n",
    "\n",
    "This is a very important property of distances that makes them practically useable in neuroscience applications, where we get much fewer measurement dimensions than the number of neurons in the cortical area we are studying. For example, neural recordings capture a very small fraction of neurons but can be construed as a particular kind of random sample of projections. In functional MRI, voxels contain many neurons and give average responses. However, to the extent that voxels average random sets of neurons, they provide a set of random projections of the neural responses. The Johnson-Lindenstrauss Lemma gives us the hope that these few randomly chosen measurement dimensions will allow a meaningful approximation of the distances and, thus, meaningful comparisons between models and brains.\n",
    "\n",
    "In data science, random projections are an approach to handling big, high-dimensional data. The Johnson-Lindenstrauss Lemma is important across many disciplines."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c0933d1-6ab1-4114-b1b4-05215ac74dfa",
   "metadata": {
    "execution": {}
   },
   "source": [
    "We choose one pair of stimuli for illustration purposes. After we project the data onto a dimension $m$ (where $m=2, 4, 8,\\ldots, 512$), we calculate the Euclidean distance between the stimulus pair in the projected space and compare it with the distance in the original space."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10065087-fb1f-4e62-852a-b08e59a2069b",
   "metadata": {
    "execution": {}
   },
   "source": [
    "## Coding exercise 3: Random projections\n",
    "\n",
    "Generate a random projection matrix $A$ from the original space to the $d$ dimensional space inside the for loop. The entries of $A$ can be filled with random normal variables."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11151d07-f096-40d2-8768-d85693fb23fc",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "```python\n",
    "stim_idx = [0,1] # change stimulus index to visualize another pair of stimuli\n",
    "m_dims_list = np.power(2, np.arange(1,10))\n",
    "true_dist, projected_dist = {}, {}\n",
    "for i, n_neurons in enumerate(n_neurons_list):\n",
    "    data = clean_dataset[n_neurons].sel({\"stim\": stim_idx})\n",
    "    # Let's first recalculate the ground truth euclidean rdm again, without normalization by the number of neurons this time.\n",
    "    true_dist[n_neurons] = calc_rdm(data, method='euclidean', noise=None, normalize_by_channels=False).dissimilarities.item()\n",
    "\n",
    "    projected_dist[n_neurons]=[]\n",
    "    for m_dims in m_dims_list:\n",
    "        #################################################\n",
    "        raise NotImplementedError(\"Student exercise: generate matrix A which projects from dimensionality neurons's amount to d-dimensional space\")\n",
    "        #################################################\n",
    "        A = np.random.normal(loc=..., scale=..., size=(..., ...))\n",
    "        A *= np.sqrt(1/m_dims)\n",
    "        transformed_data = (data.values @ A)\n",
    "        transformed_data = np2xr(transformed_data, coords={'stim': data.stim.values, 'neuron': np.arange(m_dims)})\n",
    "        rdm = calc_rdm(transformed_data, method='euclidean', noise=None, normalize_by_channels=False)\n",
    "        projected_dist[n_neurons].append(rdm.dissimilarities.item())\n",
    "    projected_dist[n_neurons] = np.array(projected_dist[n_neurons])\n",
    "\n",
    "plot_distance_after_projection(true_dist, projected_dist, n_neurons_list, m_dims_list)\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4eeb51b9-2bb6-4e3a-9169-bf3bd837cdd8",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "# to_remove solution\n",
    "\n",
    "stim_idx = [0,1] # change stimulus index to visualize another pair of stimuli\n",
    "m_dims_list = np.power(2, np.arange(1,10))\n",
    "true_dist, projected_dist = {}, {}\n",
    "for i, n_neurons in enumerate(n_neurons_list):\n",
    "    data = clean_dataset[n_neurons].sel({\"stim\": stim_idx})\n",
    "    # Let's first recalculate the ground truth euclidean rdm again, without normalization by the number of neurons this time.\n",
    "    true_dist[n_neurons] = calc_rdm(data, method='euclidean', noise=None, normalize_by_channels=False).dissimilarities.item()\n",
    "\n",
    "    projected_dist[n_neurons]=[]\n",
    "    for m_dims in m_dims_list:\n",
    "        A = np.random.normal(loc=0, scale=1, size=(n_neurons, m_dims))\n",
    "        A *= np.sqrt(1/m_dims)\n",
    "        transformed_data = (data.values @ A)\n",
    "        transformed_data = np2xr(transformed_data, coords={'stim': data.stim.values, 'neuron': np.arange(m_dims)})\n",
    "        rdm = calc_rdm(transformed_data, method='euclidean', noise=None, normalize_by_channels=False)\n",
    "        projected_dist[n_neurons].append(rdm.dissimilarities.item())\n",
    "    projected_dist[n_neurons] = np.array(projected_dist[n_neurons])\n",
    "\n",
    "plot_distance_after_projection(true_dist, projected_dist, n_neurons_list, m_dims_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb865b24-c894-4546-a66b-829c9b1d6f6f",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Notice that the distance in the transformed space is close to the original distance for relatively small $k$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1bbadb99-7fcd-4057-bc1e-e95fd44786eb",
   "metadata": {
    "execution": {}
   },
   "source": [
    "## Discussion\n",
    "\n",
    "1. Does the amount of distortion after projection depend on the dimension $d$ of the original space? Observe the dimension $k$ that preserves Euclidean distance up to a small distortion for both the 2-neuron and 100-neuron datasets.\n",
    "\n",
    "2. What is the distance between two identical stimuli after random projection?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29ad7ab2-dd80-46df-b087-db086f5ca87e",
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "#to_remove explanation\n",
    "\n",
    "\"\"\"\n",
    "Discussion: 1. Does the amount of distortion after projection depend on the dimension $d$ of the original space? Observe the dimension $k$ that preserves Euclidean distance up to a small distortion for both the 2-neuron and 100-neuron datasets.\n",
    "\n",
    "2. What is the distance between two identical stimuli after random projection?\n",
    "\n",
    "1. No. Empirically, the dimension that preserves Euclidean distance up to a small distortion for the 100-neuron dataset is similar to the 2-neuron dataset. Theoretically, the distortion bound is independent of the original dimension (https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma).\n",
    "\n",
    "2. The distance is always 0.\n",
    "\"\"\";"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Submit your feedback\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1cabc7df-6889-48d7-bc7b-12f1885a4df9",
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Submit your feedback\n",
    "content_review(f\"{feedback_prefix}_random_projections\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42ae5467-7a54-4250-826d-98f9472a9b47",
   "metadata": {
    "execution": {}
   },
   "source": [
    "---\n",
    "# Summary\n",
    "\n",
    "*Estimated timing of tutorial: 1 hour*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94c4d830-db7c-40df-8b03-57102b35e184",
   "metadata": {
    "execution": {}
   },
   "source": [
    "In this tutorial, we have learned:\n",
    "\n",
    "1. The differences between the (squared) Euclidean and Mahalanobis distance measures. The Mahalanobis distance takes into account the noise covariances between neurons, while the Euclidean distance assumes isotropic noise.\n",
    "2. Representational distance reflects discriminability (decodability) between stimulus pairs (Kriegeskorte & Diedrichsen, 2019).\n",
    "   - If we assume additive Gaussian noise that is independent and identically distributed across neurons (isotropic) and stimuli (homoscedastic), then the Euclidean distance in the multivariate response space precisely defines the discriminability of a pair of stimuli in the representation.\n",
    "   - If we assume that the noise is correlated across neurons (nonisotropic) and i.i.d across stimuli (homoscedastic), then the Mahalanobis distance defines the discriminability.\n",
    "3. Cross-validated distance estimators (cross-validated Euclidean or Mahalanobis distance) can remove the positive bias introduced by noise.\n",
    "4. The Johnson–Lindenstrauss Lemma shows that random projections preserve the Euclidean distance with some distortions. Crucially, the distortion does not depend on the dimensionality of the original space."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "include_colab_link": true,
   "name": "W1D3_Tutorial4",
   "provenance": [],
   "toc_visible": true
  },
  "kernel": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}