{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {},
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/neuromatch/NeuroAI_Course/blob/main/tutorials/W2D4_Macrolearning/student/W2D4_Tutorial5.ipynb\" target=\"_blank\"><img alt=\"Open In Colab\" src=\"https://colab.research.google.com/assets/colab-badge.svg\"/></a>   <a href=\"https://kaggle.com/kernels/welcome?src=https://raw.githubusercontent.com/neuromatch/NeuroAI_Course/main/tutorials/W2D4_Macrolearning/student/W2D4_Tutorial5.ipynb\" target=\"_blank\"><img alt=\"Open in Kaggle\" src=\"https://kaggle.com/static/images/open-in-kaggle.svg\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# Tutorial 5: Replay\n",
    "\n",
    "**Week 2, Day 4: Macro-Learning**\n",
    "\n",
    "**By Neuromatch Academy**\n",
    "\n",
    "__Content creators:__ Hlib Solodzhuk, Ximeng Mao, Grace Lindsay\n",
    "\n",
    "__Content reviewers:__ Aakash Agrawal, Alish Dipani, Hossein Rezaei, Yousef Ghanbari, Mostafa Abdollahi, Hlib Solodzhuk, Ximeng Mao, Grace Lindsay\n",
    "\n",
    "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "___\n",
    "\n",
    "\n",
    "# Tutorial Objectives\n",
    "\n",
    "*Estimated timing of tutorial: 40 minutes*\n",
    "\n",
    "In this tutorial, you will discover what replay is and how it helps with continual learning."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @markdown\n",
    "from IPython.display import IFrame\n",
    "from ipywidgets import widgets\n",
    "out = widgets.Output()\n",
    "with out:\n",
    "    print(f\"If you want to download the slides: https://osf.io/download/t36w8/\")\n",
    "    display(IFrame(src=f\"https://mfr.ca-1.osf.io/render?url=https://osf.io/t36w8/?direct%26mode=render%26action=download%26mode=render\", width=730, height=410))\n",
    "display(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "---\n",
    "# Setup\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Install and import feedback gadget\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Install and import feedback gadget\n",
    "\n",
    "!pip install numpy matplotlib scikit-learn ipywidgets jupyter-ui-poll torch vibecheck --quiet\n",
    "\n",
    "from vibecheck import DatatopsContentReviewContainer\n",
    "def content_review(notebook_section: str):\n",
    "    return DatatopsContentReviewContainer(\n",
    "        \"\",  # No text prompt\n",
    "        notebook_section,\n",
    "        {\n",
    "            \"url\": \"https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab\",\n",
    "            \"name\": \"neuromatch_neuroai\",\n",
    "            \"user_key\": \"wb2cxze8\",\n",
    "        },\n",
    "    ).render()\n",
    "\n",
    "\n",
    "feedback_prefix = \"W2D4_T5\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Imports\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Imports\n",
    "\n",
    "#working with data\n",
    "import numpy as np\n",
    "import random\n",
    "\n",
    "#plotting\n",
    "import matplotlib.pyplot as plt\n",
    "import logging\n",
    "from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix\n",
    "\n",
    "#interactive display\n",
    "import ipywidgets as widgets\n",
    "from IPython.display import display, clear_output\n",
    "from jupyter_ui_poll import ui_events\n",
    "import time\n",
    "from tqdm.notebook import tqdm\n",
    "\n",
    "#modeling\n",
    "import copy\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "import torch.optim as optim\n",
    "from torch.autograd import Variable"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Figure settings\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Figure settings\n",
    "\n",
    "logging.getLogger('matplotlib.font_manager').disabled = True\n",
    "\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format = 'retina' # perfrom high definition rendering for images and plots\n",
    "plt.style.use(\"https://raw.githubusercontent.com/NeuromatchAcademy/course-content/main/nma.mplstyle\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Plotting functions\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Plotting functions\n",
    "\n",
    "def plot_rewards(rewards, max_rewards):\n",
    "    \"\"\"\n",
    "    Plot the rewards over time.\n",
    "\n",
    "    Inputs:\n",
    "    - rewards (list): list containing the rewards at each time step.\n",
    "    - max_rewards(list): list containing the maximum rewards at each time step.\n",
    "    \"\"\"\n",
    "    with plt.xkcd():\n",
    "        plt.plot(range(len(rewards)), rewards, marker='o', label = \"Obtained Reward\")\n",
    "        plt.plot(range(len(max_rewards)), max_rewards, marker='*', label = \"Maximum Reward\")\n",
    "        plt.xlabel('Time Step')\n",
    "        plt.ylabel('Reward Value')\n",
    "        plt.title('Reward Over Time')\n",
    "        plt.yticks(np.arange(0, 5, 1))\n",
    "        plt.xticks(np.arange(0, len(rewards), 1))\n",
    "        plt.legend()\n",
    "        plt.show()\n",
    "\n",
    "def plot_confusion_matrix(rewards, max_rewards, mode = 1):\n",
    "    \"\"\"\n",
    "    Plots the confusion matrix for the chosen rewards and the maximum ones.\n",
    "\n",
    "    Inputs:\n",
    "    - rewards (list): list containing the rewards at each time step.\n",
    "    - max_rewards (list): list containing the maximum rewards at each time step.\n",
    "    - mode (int, default = 1): mode of the environment.\n",
    "    \"\"\"\n",
    "    with plt.xkcd():\n",
    "\n",
    "      all_colors = [color for color in mode_colors[mode]]\n",
    "\n",
    "      cm = confusion_matrix(max_rewards, rewards)\n",
    "\n",
    "      missing_classes = np.setdiff1d(np.array([color_names_rewards[color_name] for color_name in all_colors]), np.unique(max_rewards + rewards))\n",
    "      for cls in missing_classes:\n",
    "          cm = np.insert(cm, cls - 1, 0, axis=0)\n",
    "          cm = np.insert(cm, cls - 1, 0, axis=1)\n",
    "\n",
    "      cm = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels = all_colors)\n",
    "      cm.plot()\n",
    "      plt.xlabel(\"Chosen color\")\n",
    "      plt.ylabel(\"Maximum-reward color\")\n",
    "      plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Helper functions\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Helper functions\n",
    "\n",
    "def run_dummy_agent(env):\n",
    "    \"\"\"\n",
    "    Implement dummy agent strategy: chooses random action.\n",
    "\n",
    "    Inputs:\n",
    "    - env (ChangingEnv): An environment.\n",
    "    \"\"\"\n",
    "    action = 0\n",
    "    rewards = [0]\n",
    "    max_rewards = [0]\n",
    "\n",
    "    for _ in (range(num_trials)):\n",
    "        _, reward, max_reward = env.step(action)\n",
    "        rewards.append(reward)\n",
    "        max_rewards.append(max_reward)\n",
    "\n",
    "        #dummy agent\n",
    "        if np.random.random() < 0.5:\n",
    "            action = 1 - action #change action\n",
    "    return rewards, max_rewards\n",
    "\n",
    "color_names_rewards = {\n",
    "    \"red\": 1,\n",
    "    \"yellow\": 2,\n",
    "    \"green\": 3,\n",
    "    \"blue\": 4\n",
    "}\n",
    "\n",
    "color_names_values = {\n",
    "    \"red\": [255, 0, 0],\n",
    "    \"yellow\": [255, 255, 0],\n",
    "    \"green\": [0, 128, 0],\n",
    "    \"blue\": [0, 0, 255]\n",
    "}\n",
    "\n",
    "first_mode = [\"red\", \"yellow\", \"green\"]\n",
    "second_mode = [\"red\", \"green\", \"blue\"]\n",
    "\n",
    "mode_colors = {\n",
    "    1: first_mode,\n",
    "    2: second_mode\n",
    "}\n",
    "\n",
    "def game():\n",
    "    \"\"\"\n",
    "    Create interactive game for this tutorial.\n",
    "    \"\"\"\n",
    "\n",
    "    total_reward = 0\n",
    "    message = \"Start of the game!\"\n",
    "\n",
    "    left_button = widgets.Button(description=\"Left\")\n",
    "    right_button = widgets.Button(description=\"Right\")\n",
    "    button_box = widgets.HBox([left_button, right_button])\n",
    "\n",
    "    def define_choice(button):\n",
    "        \"\"\"\n",
    "        Change `choice` variable with respect to the pressed button.\n",
    "        \"\"\"\n",
    "        nonlocal choice\n",
    "        display(widgets.HTML(f\"<h3>{button.description}</h3>\"))\n",
    "        print(button.description)\n",
    "        if button.description == \"Left\":\n",
    "            choice = 0\n",
    "        else:\n",
    "            choice = 1\n",
    "\n",
    "    left_button.on_click(define_choice)\n",
    "    right_button.on_click(define_choice)\n",
    "\n",
    "    attempt = 0\n",
    "    total_attempts = 30\n",
    "\n",
    "    for mode in [first_mode, second_mode]:\n",
    "        for index in range(15):\n",
    "            attempt += 1\n",
    "            start_time = time.time()\n",
    "            first_color, second_color = np.random.choice(mode, 2, replace=False)\n",
    "            clear_output()\n",
    "            display(widgets.HTML(f\"<h3>{message}</h3>\"))\n",
    "            display(widgets.HTML(f\"<h3>Total reward: {total_reward}</h3>\"))\n",
    "            display(widgets.HTML(f\"<h4>Attempt {attempt} of {total_attempts}</h4>\"))\n",
    "            display(widgets.HTML(f\"<h4>Objects:</h4>\"))\n",
    "\n",
    "            with plt.xkcd():\n",
    "                fig, axs = plt.subplots(1, 2, figsize=(8, 4))\n",
    "\n",
    "                axs[0].add_patch(plt.Circle((0.5, 0.5), 0.3, color=first_color))\n",
    "                axs[0].set_xlim(0, 1)\n",
    "                axs[0].set_ylim(0, 1)\n",
    "                axs[0].axis('off')\n",
    "\n",
    "                axs[1].add_patch(plt.Circle((0.5, 0.5), 0.3, color=second_color))\n",
    "                axs[1].set_xlim(0, 1)\n",
    "                axs[1].set_ylim(0, 1)\n",
    "                axs[1].axis('off')\n",
    "\n",
    "                plt.show()\n",
    "\n",
    "            display(widgets.HTML(\"<h4>Choose Left or Right:</h4>\"))\n",
    "            display(button_box)\n",
    "\n",
    "            choice = -1\n",
    "            with ui_events() as poll:\n",
    "                while choice == -1:\n",
    "                    poll(10)\n",
    "                    time.sleep(0.1)\n",
    "                    if time.time() - start_time > 60:\n",
    "                        return\n",
    "            if choice == 0:\n",
    "                reward = color_names_rewards[first_color]\n",
    "            else:\n",
    "                reward = color_names_rewards[second_color]\n",
    "            total_reward += reward\n",
    "            message = f\"You received a reward of +{reward}.\"\n",
    "    clear_output()\n",
    "    display(widgets.HTML(f\"<h3>Your total reward: {total_reward}. Congratulations! Do you have any idea what you should do to maximize the reward?</h3>\"))\n",
    "\n",
    "class ReplayBufferSolution():\n",
    "    def __init__(self, max_experience = 250, num_trials = 100):\n",
    "        \"\"\"Initialize replay buffer.\n",
    "        Notice that when replay buffer is full of experience and new one should be remembered, it replaces existing ones, starting\n",
    "        from the oldest.\n",
    "\n",
    "        Inputs:\n",
    "        - max_experience (int, default = 250): the maximum number of experience (gradient steps) which can be stored.\n",
    "        - num_trials (int, default = 100): number of times the agent is exposed to the environment per gradient step to be trained.\n",
    "        \"\"\"\n",
    "        self.max_experience = max_experience\n",
    "\n",
    "        #variable which fully describe experience\n",
    "        self.losses = [0 for _ in range(self.max_experience)]\n",
    "\n",
    "        #number of memory cell to point to (write or overwrite experience)\n",
    "        self.writing_pointer = 0\n",
    "        self.reading_pointer = 0\n",
    "\n",
    "        #to keep track how many experience there were\n",
    "        self.num_experience = 0\n",
    "\n",
    "    def write_experience(self, loss):\n",
    "        \"\"\"Write new experience.\"\"\"\n",
    "        self.losses[self.writing_pointer] = loss\n",
    "\n",
    "        #so that pointer is in range of max_experience and will point to the older experience while full\n",
    "        self.writing_pointer = (self.writing_pointer + 1) % self.max_experience\n",
    "        self.num_experience += 1\n",
    "\n",
    "    def read_experience(self):\n",
    "        \"\"\"Read existing experience.\"\"\"\n",
    "        loss = self.losses[self.reading_pointer]\n",
    "\n",
    "        #so that pointer is in range of self.max_experience and will point to the older experience while full\n",
    "        self.reading_pointer = (self.reading_pointer + 1) % min(self.max_experience, self.num_experience)\n",
    "        return loss"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Data retrieval\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Data retrieval\n",
    "\n",
    "import os\n",
    "import requests\n",
    "import hashlib\n",
    "\n",
    "# Variables for file and download URL\n",
    "fnames = [\"FirstModeAgent.pt\", \"SecondModeAgent.pt\"] # The names of the files to be downloaded\n",
    "urls = [\"https://osf.io/zuxc4/download\", \"https://osf.io/j9kht/download\"] # URLs from where the files will be downloaded\n",
    "expected_md5s = [\"eca5aa69751dad8ca06742c819f2dc76\", \"cdd0338d0b40ade20d6433cd615aaa82\"] # MD5 hashes for verifying files integrity\n",
    "\n",
    "for fname, url, expected_md5 in zip(fnames, urls, expected_md5s):\n",
    "    if not os.path.isfile(fname):\n",
    "        try:\n",
    "            # Attempt to download the file\n",
    "            r = requests.get(url) # Make a GET request to the specified URL\n",
    "        except requests.ConnectionError:\n",
    "            # Handle connection errors during the download\n",
    "            print(\"!!! Failed to download data !!!\")\n",
    "        else:\n",
    "            # No connection errors, proceed to check the response\n",
    "            if r.status_code != requests.codes.ok:\n",
    "                # Check if the HTTP response status code indicates a successful download\n",
    "                print(\"!!! Failed to download data !!!\")\n",
    "            elif hashlib.md5(r.content).hexdigest() != expected_md5:\n",
    "                # Verify the integrity of the downloaded file using MD5 checksum\n",
    "                print(\"!!! Data download appears corrupted !!!\")\n",
    "            else:\n",
    "                # If download is successful and data is not corrupted, save the file\n",
    "                with open(fname, \"wb\") as fid:\n",
    "                    fid.write(r.content) # Write the downloaded content to a file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Set random seed\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Set random seed\n",
    "\n",
    "import random\n",
    "import numpy as np\n",
    "import torch\n",
    "\n",
    "def set_seed(seed=None, seed_torch=True):\n",
    "  if seed is None:\n",
    "    seed = np.random.choice(2 ** 32)\n",
    "  random.seed(seed)\n",
    "  np.random.seed(seed)\n",
    "  if seed_torch:\n",
    "    torch.manual_seed(seed)\n",
    "    torch.cuda.manual_seed_all(seed)\n",
    "    torch.cuda.manual_seed(seed)\n",
    "    torch.backends.cudnn.benchmark = False\n",
    "    torch.backends.cudnn.deterministic = True\n",
    "\n",
    "set_seed(seed = 42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "---\n",
    "\n",
    "# Section 0: Let's play a new game!\n",
    "\n",
    "As in the previous tutorial, this one is going to be focused on an RL setup, thus, we would like you to play a slightly different game to get an idea of what the agent is going to learn. The rules are the same: you need to pick one of two displayed objects. Please watch any exciting patterns and observations and discuss them with your group before going to the video."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Make sure you execute this cell to play the game!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Make sure you execute this cell to play the game!\n",
    "\n",
    "game()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Submit your feedback\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Submit your feedback\n",
    "content_review(f\"{feedback_prefix}_new_game\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Video 1: Replay\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Video 1: Replay\n",
    "\n",
    "from ipywidgets import widgets\n",
    "from IPython.display import YouTubeVideo\n",
    "from IPython.display import IFrame\n",
    "from IPython.display import display\n",
    "\n",
    "class PlayVideo(IFrame):\n",
    "  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
    "    self.id = id\n",
    "    if source == 'Bilibili':\n",
    "      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
    "    elif source == 'Osf':\n",
    "      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
    "    super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
    "\n",
    "def display_videos(video_ids, W=400, H=300, fs=1):\n",
    "  tab_contents = []\n",
    "  for i, video_id in enumerate(video_ids):\n",
    "    out = widgets.Output()\n",
    "    with out:\n",
    "      if video_ids[i][0] == 'Youtube':\n",
    "        video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
    "                             height=H, fs=fs, rel=0)\n",
    "        print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
    "      else:\n",
    "        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
    "                          height=H, fs=fs, autoplay=False)\n",
    "        if video_ids[i][0] == 'Bilibili':\n",
    "          print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
    "        elif video_ids[i][0] == 'Osf':\n",
    "          print(f'Video available at https://osf.io/{video.id}')\n",
    "      display(video)\n",
    "    tab_contents.append(out)\n",
    "  return tab_contents\n",
    "\n",
    "video_ids = [('Youtube', 'Oc8PpAh9exw'), ('Bilibili', 'BV1UM4m1U7Cy')]\n",
    "tab_contents = display_videos(video_ids, W=730, H=410)\n",
    "tabs = widgets.Tab()\n",
    "tabs.children = tab_contents\n",
    "for i in range(len(tab_contents)):\n",
    "  tabs.set_title(i, video_ids[i][0])\n",
    "display(tabs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Submit your feedback\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Submit your feedback\n",
    "content_review(f\"{feedback_prefix}_replay\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "---\n",
    "\n",
    "# Section 1: Changing Environment\n",
    "\n",
    "As mentioned in the video, to study replay, we need to use a slightly different task inspired by the Harlow task, which creates an incentive to remember past data. In this section, we will introduce this new task environment, which replicates the game you played."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "## Exercise 1: Colorful State\n",
    "\n",
    "For this tutorial, each state will be represented by its color (via its RGB values; thus, it is a vector of 3 values), and each color is associated with a stable reward that remains unchanged over time (the rewards will correspond to the position of the color in the rainbow).\n",
    "\n",
    "While the reward associated with each color does not change over time, the colors presented to the agent will change. Specifically, on each trial, the agent is presented with two colors and should choose the one associated with a higher reward. Initially (in 'mode 1'), colors will be chosen from a set of 3 possible colors. Over time, one of these colors will be replaced by another, creating a different set of three possible colors ('mode 2'). This constitutes a covariate distribution shift and may cause the agent to forget the reward associated with the dropped color."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "color_names_rewards = {\n",
    "    \"red\": 1,\n",
    "    \"yellow\": 2,\n",
    "    \"green\": 3,\n",
    "    \"blue\": 4\n",
    "}\n",
    "\n",
    "color_names_values = {\n",
    "    \"red\": [255, 0, 0],\n",
    "    \"yellow\": [255, 255, 0],\n",
    "    \"green\": [0, 128, 0],\n",
    "    \"blue\": [0, 0, 255]\n",
    "}\n",
    "\n",
    "first_mode = [\"red\", \"yellow\", \"green\"]\n",
    "second_mode = [\"red\", \"green\", \"blue\"]\n",
    "\n",
    "mode_colors = {\n",
    "    1: first_mode,\n",
    "    2: second_mode\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "class ChangingEnv():\n",
    "    def __init__(self, mode = 1):\n",
    "        \"\"\"Initialize changing environment.\n",
    "\n",
    "        Inputs:\n",
    "        - mode (int, default = 1): defines mode of the enviornment. Should be only 1 or 2.\n",
    "        \"\"\"\n",
    "        if mode not in [1, 2]:\n",
    "            raise ValueError(\"Mode is out of allowed range. Please consider entering 1 or 2 as digit.\")\n",
    "\n",
    "        self.mode = mode\n",
    "        self.colors = mode_colors[self.mode]\n",
    "        self.update_state()\n",
    "\n",
    "    def update_state(self):\n",
    "        \"\"\"Update state which depends on the mode of the environment.\"\"\"\n",
    "        self.first_color, self.second_color = np.random.choice(self.colors, 2, replace = False)\n",
    "        self.color_state = np.array([self.first_color, self.second_color])\n",
    "        self.state = np.array([color_names_values[self.first_color], color_names_values[self.second_color]])\n",
    "\n",
    "    def reset(self, mode = 1):\n",
    "        \"\"\"Reset environment by updating its mode (colors to sample from). Set the first state in the given mode.\"\"\"\n",
    "        self.mode = mode\n",
    "        self.colors = mode_colors[self.mode]\n",
    "        self.update_state()\n",
    "        return self.state\n",
    "\n",
    "    def step(self, action):\n",
    "        \"\"\"Evaluate agent's perfromance, return reward, max reward (for tracking agent's performance) and next observation.\"\"\"\n",
    "        feedback = color_names_rewards[self.color_state[action]]\n",
    "        max_feedback = np.max([color_names_rewards[self.color_state[action]], color_names_rewards[self.color_state[1 - action]]])\n",
    "        self.update_state()\n",
    "        return self.state, feedback, max_feedback"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "As in the previous tutorial, let us test the environment with a dummy agent. For this particular environment (in mode 1), we will use a random strategy — just select one of the two colors by tossing a fair coin."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Make sure you execute this cell to observe the plot!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Make sure you execute this cell to observe the plot!\n",
    "\n",
    "set_seed(42)\n",
    "num_trials = 20\n",
    "env = ChangingEnv()\n",
    "env.reset()\n",
    "rewards, max_rewards = run_dummy_agent(env)\n",
    "\n",
    "plot_rewards(rewards, max_rewards)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Observe that the maximum reward is always higher than the obtained reward or coincides with it (when the agent luckily chooses a more rewarded color)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Submit your feedback\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Submit your feedback\n",
    "content_review(f\"{feedback_prefix}_colorful_state\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "---\n",
    "\n",
    "# Section 2: A2C Agent in Changing Environment\n",
    "\n",
    "*Estimated timing to here from start of tutorial: 10 minutes*\n",
    "\n",
    "**For now, simply run the following 2 cells (`ActorCritic` class and `train_agent` function) without exploring their content. You can come back to the code if you have time at the end.**\n",
    "\n",
    "Welcome back our friend from the previous tutorial, the A2C agent ;) Here, we have slightly modified the architecture (replacing LSTM cells with a single linear layer with ReLUs on top of it). The variable `num_inputs` has also been changed, as the input is now represented by a 3-dimensional vector instead of a single digit. Moreover, we will separate the training and evaluation functions, as we don't have a \"task\" and \"meta-space of tasks\" notion here, so we don't need to keep track of this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "class ActorCritic(nn.Module):\n",
    "    def __init__(self, hidden_size, num_inputs = 9, num_actions = 2):\n",
    "        \"\"\"Initialize Actor-Critic agent.\"\"\"\n",
    "        super(ActorCritic, self).__init__()\n",
    "\n",
    "        #num_actions is 2 because left/right hand\n",
    "        self.num_actions = num_actions\n",
    "\n",
    "        #num_inputs is 9 because one-hot encoding of action (2) + reward (1) + previous state (2*3 = 6)\n",
    "        self.num_inputs = num_inputs\n",
    "\n",
    "        self.hidden_size = hidden_size\n",
    "\n",
    "        #hyperparameters involved in training (important to keep assigned to the agent)\n",
    "        self.learning_rate = 0.00075 #learning rate for optimizer\n",
    "        self.discount_factor = 0.91 #gamma\n",
    "        self.state_value_estimate_cost = 0.4 #beta_v\n",
    "        self.entropy_cost = 0.001 #beta_e\n",
    "\n",
    "        self.emb = nn.Linear(num_inputs, hidden_size)\n",
    "        self.linear1 = nn.Linear(hidden_size, hidden_size)\n",
    "        self.relu1 = nn.ReLU()\n",
    "        self.critic_linear = nn.Linear(hidden_size, 1)\n",
    "        self.actor_linear = nn.Linear(hidden_size, num_actions)\n",
    "\n",
    "    def forward(self, state):\n",
    "        \"\"\"Implement forward pass through agent.\"\"\"\n",
    "        #at first, input goes through embedding\n",
    "        state = F.linear(state.unsqueeze(0), self.emb.weight.clone(), self.emb.bias)\n",
    "        state = self.relu1(F.linear(state, self.linear1.weight.clone(), self.linear1.bias))\n",
    "\n",
    "        #critic -> value\n",
    "        value = F.linear(state, self.critic_linear.weight.clone(), self.critic_linear.bias)\n",
    "\n",
    "        #actor -> policy\n",
    "        policy_logits = F.linear(state, self.actor_linear.weight.clone(), self.actor_linear.bias)\n",
    "\n",
    "        return value, policy_logits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "In the cell below, we define the training procedure for the A2C agent and its evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "def train_agent(env, agent, optimizer_func, mode = 1, num_gradient_steps = 1000, num_trials = 100):\n",
    "    \"\"\"Training for agent in changing colorful environment.\n",
    "    Observe that training happens for one particular mode.\n",
    "\n",
    "    Inputs:\n",
    "    - env (ChangingEnv): environment.\n",
    "    - agent (ActorCritic): particular instance of Actor Critic agent to train.\n",
    "    - optimizer_func (torch.Optim): optimizer to use for training.\n",
    "    - mode (int, default = 1): mode of the environment.\n",
    "    - num_gradient_steps (int, default = 1000): number of gradient steps to perform.\n",
    "    - num_trials (int, default = 200): number of times the agent is exposed to the environment per gradient step to be trained.\n",
    "    \"\"\"\n",
    "\n",
    "    #reset environment\n",
    "    state = env.reset(mode = mode)\n",
    "\n",
    "    #define optimizer\n",
    "    optimizer = optimizer_func(agent.parameters(), agent.learning_rate, eps = 1e-5)\n",
    "\n",
    "    for _ in range(num_gradient_steps):\n",
    "\n",
    "      #for storing variables for training\n",
    "      log_probs = []\n",
    "      values = []\n",
    "      rewards = []\n",
    "      entropy_term = torch.tensor(0.)\n",
    "\n",
    "      #start conditions\n",
    "      preceding_reward = torch.Tensor([0])\n",
    "      preceding_action = torch.Tensor([0, 0])\n",
    "\n",
    "      for trial in range(num_trials):\n",
    "          #state + reward + one-hot encoding of action; notice that we normalize state before pass to agent!\n",
    "          full_state = torch.cat((torch.from_numpy(state.flatten() / 255).float(), preceding_reward, preceding_action), dim = 0)\n",
    "          value, policy_logits = agent(full_state)\n",
    "          value = value.squeeze(0)\n",
    "\n",
    "          #sample action from policy\n",
    "          dist = torch.distributions.Categorical(logits=policy_logits.squeeze(0))\n",
    "          action = dist.sample()\n",
    "\n",
    "          #perform action to get reward and new state\n",
    "          new_state, reward, _ = env.step(action)\n",
    "\n",
    "          #we normalize reward too\n",
    "          reward /= 4\n",
    "\n",
    "          #update preceding variables\n",
    "          preceding_reward = torch.Tensor([reward])\n",
    "          preceding_action = F.one_hot(action, num_classes=2).float()\n",
    "          state = new_state\n",
    "\n",
    "          #for training\n",
    "          log_prob = dist.log_prob(action)\n",
    "          entropy = dist.entropy()\n",
    "          rewards.append(reward)\n",
    "          values.append(value)\n",
    "          log_probs.append(log_prob)\n",
    "          entropy_term += entropy\n",
    "\n",
    "      #calculataing loss\n",
    "      Qval = 0\n",
    "      Qvals = torch.zeros(len(rewards))\n",
    "      for t in reversed(range(len(rewards))):\n",
    "        Qval = rewards[t] + agent.discount_factor * Qval\n",
    "        Qvals[t] = Qval\n",
    "      values = torch.stack(values)\n",
    "      log_probs = torch.stack(log_probs)\n",
    "      advantage = Qvals - values\n",
    "      actor_loss = (-log_probs * advantage.detach()).mean()\n",
    "      critic_loss = advantage.pow(2).mean()\n",
    "      entropy_term = entropy_term / num_trials\n",
    "\n",
    "      #loss incorporates actor/critic terms + entropy\n",
    "      loss = actor_loss + agent.state_value_estimate_cost * critic_loss - agent.entropy_cost * entropy_term\n",
    "\n",
    "      optimizer.zero_grad()\n",
    "      loss.backward()\n",
    "      optimizer.step()\n",
    "\n",
    "def evaluate_agent(env, agent, mode = 1, num_evaluation_trials = 20):\n",
    "    \"\"\"Evaluation for agent in changing colorful environment.\n",
    "    Observe that evaluation happens for one particular mode which can differ from training one.\n",
    "\n",
    "    Inputs:\n",
    "    - env (ChangingEnv): environment.\n",
    "    - agent (ActorCritic): particular instance of Actor Critic agent to train.\n",
    "    - mode (int, default = 1): mode of the environment.\n",
    "    - num_evaluation_trials (int, default = 20): number of times the agent is exposed to the environment to evaluate it (no training happend during this phase).\n",
    "\n",
    "    Outputs:\n",
    "    - scores (list): rewards over all trials of evaluation.\n",
    "    - max_scores (list): maximum rewards over all trials of evaluation.\n",
    "    \"\"\"\n",
    "    #reset environment\n",
    "    state = env.reset(mode = mode)\n",
    "    scores = []\n",
    "    max_scores = []\n",
    "\n",
    "    #start conditions\n",
    "    preceding_reward = torch.Tensor([0])\n",
    "    preceding_action = torch.Tensor([0, 0])\n",
    "\n",
    "    for _ in range(num_evaluation_trials):\n",
    "\n",
    "      #state + reward + one-hot encoding of action; notice that we normalize state before pass to agent!\n",
    "      full_state = torch.cat((torch.from_numpy(state.flatten() / 255).float(), preceding_reward, preceding_action), dim = 0)\n",
    "      value, policy_logits = agent(full_state)\n",
    "      value = value.squeeze(0)\n",
    "\n",
    "      #sample action from policy\n",
    "      dist = torch.distributions.Categorical(logits=policy_logits.squeeze(0))\n",
    "      action = dist.sample()\n",
    "\n",
    "      #perform action to get reward and new state\n",
    "      new_state, reward, max_reward = env.step(action)\n",
    "\n",
    "      #update preceding variables; we normalize reward too\n",
    "      preceding_reward = torch.Tensor([reward / 4])\n",
    "      preceding_action = F.one_hot(action, num_classes=2).float()\n",
    "      state = new_state\n",
    "\n",
    "      #add reward to the scores of agent\n",
    "      scores.append(reward)\n",
    "      max_scores.append(max_reward)\n",
    "\n",
    "    return scores, max_scores"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "In the following code cell, let's observe the agent's performance on the first mode after being trained on it. As the training of the agent takes around 3 minutes, we have provided you with an already trained version (but feel free to uncomment the training code to achieve the same results). You will also have the opportunity to train the agent from scratch in the next section!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Make sure you execute this cell to observe the plot!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Make sure you execute this cell to observe the plot!\n",
    "\n",
    "set_seed(42)\n",
    "\n",
    "#define environment\n",
    "env = ChangingEnv()\n",
    "\n",
    "#load agent\n",
    "agent = torch.load(\"FirstModeAgent.pt\")\n",
    "\n",
    "#train agent\n",
    "##UNCOMMENT TO TRAIN\n",
    "\n",
    "# agent = ActorCritic(hidden_size = 100)\n",
    "# optimizer_func = optim.RMSprop\n",
    "# train_agent(env, agent, optimizer_func)\n",
    "\n",
    "##UNCOMMENT TO TRAIN\n",
    "\n",
    "#evaluate agent\n",
    "rewards, max_rewards = evaluate_agent(env, agent)\n",
    "plot_rewards(rewards, max_rewards)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Pretty nice! Let us also observe the confusion matrix. Indeed, it might reveal the weaknesses associated with particular colors. We will increase the number of evaluation trials to obtain more statistically accurate results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Make sure you execute this cell to observe the plot!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Make sure you execute this cell to observe the plot!\n",
    "set_seed(42)\n",
    "\n",
    "rewards, max_rewards = evaluate_agent(env, agent, num_evaluation_trials = 5000)\n",
    "plot_confusion_matrix(rewards, max_rewards)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "No specific patterns here; the only notable observation (which is also expected) is that whenever colors are close in their rewards, the agent makes more mistakes with those.\n",
    "\n",
    "Notice that the blue color is missing, as it is indeed excluded from the first mode. Let us evaluate the agent in the second mode."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Make sure you execute this cell to observe the plot!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Make sure you execute this cell to observe the plot!\n",
    "set_seed(42)\n",
    "\n",
    "rewards, max_rewards = evaluate_agent(env, agent, mode = 2)\n",
    "plot_rewards(rewards, max_rewards)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Let's check the confusion matrix. We can see that the green color is chosen often when the blue one provides a higher reward (which the agent doesn't know yet)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Make sure you execute this cell to observe the plot!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Make sure you execute this cell to observe the plot!\n",
    "set_seed(42)\n",
    "\n",
    "rewards, max_rewards = evaluate_agent(env, agent, mode = 2, num_evaluation_trials = 5000)\n",
    "plot_confusion_matrix(rewards, max_rewards, mode = 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "As expected, the agent doesn't know perfectly how to handle a new color.\n",
    "\n",
    "Let's continue training the same agent in the second mode and see if we can improve this situation. Again, you are provided with a pretrained agent."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Make sure you execute this cell to observe the plot!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Make sure you execute this cell to observe the plot!\n",
    "set_seed(42)\n",
    "\n",
    "#load agent\n",
    "agent = torch.load(\"SecondModeAgent.pt\")\n",
    "\n",
    "##UNCOMMENT TO TRAIN\n",
    "\n",
    "# env = ChangingEnv()\n",
    "# optimizer_func = optim.RMSprop\n",
    "# train_agent(env, agent, optimizer_func, mode = 2)\n",
    "\n",
    "##UNCOMMENT TO TRAIN\n",
    "\n",
    "rewards, max_rewards = evaluate_agent(env, agent, mode = 2)\n",
    "plot_rewards(rewards, max_rewards)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Make sure you execute this cell to observe the plot!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Make sure you execute this cell to observe the plot!\n",
    "set_seed(42)\n",
    "\n",
    "rewards, max_rewards = evaluate_agent(env, agent, mode = 2, num_evaluation_trials = 5000)\n",
    "plot_confusion_matrix(rewards, max_rewards, mode = 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Awesome! The agent has improved its ability to perform in the second mode. But what about the first one? Did the agent forget the previously seen colors?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Make sure you execute this cell to observe the plot!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Make sure you execute this cell to observe the plot!\n",
    "set_seed(42)\n",
    "\n",
    "rewards, max_rewards = evaluate_agent(env, agent, mode = 1)\n",
    "plot_rewards(rewards, max_rewards)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Make sure you execute this cell to observe the plot!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Make sure you execute this cell to observe the plot!\n",
    "set_seed(42)\n",
    "\n",
    "rewards, max_rewards = evaluate_agent(env, agent, mode = 1, num_evaluation_trials = 5000)\n",
    "plot_confusion_matrix(rewards, max_rewards)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Oops! The introduction of the blue color in the second mode disrupted the learned relationships between red and yellow (since we didn't include yellow in the second mode). What should we do? In the next section, you will explore a bio-inspired mechanism that allows for correcting this behavior!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Submit your feedback\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Submit your feedback\n",
    "content_review(f\"{feedback_prefix}_a2c_agent_in_changing_enviornment\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "---\n",
    "\n",
    "# Section 3: Replay Buffer\n",
    "\n",
    "*Estimated timing to here from start of tutorial: 25 minutes*\n",
    "\n",
    "This section discusses the underlying biological reasoning behind the replay buffer and proposes its code implementation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "## Coding Exercise 2: Experience Again\n",
    "\n",
    "A replay buffer is a mechanism that allows an animal to remember certain experiences within an environment, which can be replayed in its mind later. This can be seen as akin to joint training, as it lets information from a past environment impact current learning.\n",
    "\n",
    "Each of the gradient steps in the first mode is going to be an \"experience\" we are going to save, and we will play artificially (train) during training in the second mode. For that, before going to the coding part, let us take a look at the training function defined earlier. Which variables do you think we need to preserve in the proposed auxiliary storage that will allow the agent to implement the replay?\n",
    "\n",
    "The procedure for retrieving the past experience is as follows: for each gradient step in the new mode, there is going to be one gradient step from a remembered experience from the previous mode.\n",
    "\n",
    "In this exercise, you need to complete the `ReplayBuffer` class, which will help you remember information about the training experience. Observe that `train_agent` is redefined and slightly modified so it accepts `ReplayBuffer` instance as input."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "```python\n",
    "class ReplayBuffer():\n",
    "    def __init__(self, max_experience = 250, num_trials = 100):\n",
    "        \"\"\"Initialize replay buffer.\n",
    "        Notice that when replay buffer is full of experience and new one should be remembered, it replaces existing ones, starting\n",
    "        from the oldest.\n",
    "\n",
    "        Inputs:\n",
    "        - max_experience (int, default = 250): the maximum number of experience (gradient steps) which can be stored.\n",
    "        - num_trials (int, default = 100): number of times the agent is exposed to the environment per gradient step to be trained.\n",
    "        \"\"\"\n",
    "        self.max_experience = max_experience\n",
    "\n",
    "        #variable which fully describe experience\n",
    "        self.losses = [0 for _ in range(self.max_experience)]\n",
    "\n",
    "        #number of memory cell to point to (write or overwrite experience)\n",
    "        self.writing_pointer = 0\n",
    "        self.reading_pointer = 0\n",
    "\n",
    "        #to keep track how many experience there were\n",
    "        self.num_experience = 0\n",
    "\n",
    "    def write_experience(self, loss):\n",
    "        \"\"\"Write new experience.\"\"\"\n",
    "        ###################################################################\n",
    "        ## Fill out the following then remove\n",
    "        raise NotImplementedError(\"Student exercise: complete retrieval and storing procedure for replay buffer.\")\n",
    "        ###################################################################\n",
    "        self.losses[...] = ...\n",
    "\n",
    "        #so that pointer is in range of max_experience and will point to the older experience while full\n",
    "        self.writing_pointer = (self.writing_pointer + 1) % self.max_experience\n",
    "        self.num_experience += 1\n",
    "\n",
    "    def read_experience(self):\n",
    "        \"\"\"Read existing experience.\"\"\"\n",
    "        loss = self.losses[...]\n",
    "\n",
    "        #so that pointer is in range of self.max_experience and will point to the older experience while full\n",
    "        self.reading_pointer = (self.reading_pointer + 1) % min(self.max_experience, self.num_experience)\n",
    "        return loss\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "# to_remove solution\n",
    "\n",
    "class ReplayBuffer():\n",
    "    def __init__(self, max_experience = 250, num_trials = 100):\n",
    "        \"\"\"Initialize replay buffer.\n",
    "        Notice that when replay buffer is full of experience and new one should be remembered, it replaces existing ones, starting\n",
    "        from the oldest.\n",
    "\n",
    "        Inputs:\n",
    "        - max_experience (int, default = 250): the maximum number of experience (gradient steps) which can be stored.\n",
    "        - num_trials (int, default = 100): number of times the agent is exposed to the environment per gradient step to be trained.\n",
    "        \"\"\"\n",
    "        self.max_experience = max_experience\n",
    "\n",
    "        #variable which fully describe experience\n",
    "        self.losses = [0 for _ in range(self.max_experience)]\n",
    "\n",
    "        #number of memory cell to point to (write or overwrite experience)\n",
    "        self.writing_pointer = 0\n",
    "        self.reading_pointer = 0\n",
    "\n",
    "        #to keep track how many experience there were\n",
    "        self.num_experience = 0\n",
    "\n",
    "    def write_experience(self, loss):\n",
    "        \"\"\"Write new experience.\"\"\"\n",
    "        self.losses[self.writing_pointer] = loss\n",
    "\n",
    "        #so that pointer is in range of max_experience and will point to the older experience while full\n",
    "        self.writing_pointer = (self.writing_pointer + 1) % self.max_experience\n",
    "        self.num_experience += 1\n",
    "\n",
    "    def read_experience(self):\n",
    "        \"\"\"Read existing experience.\"\"\"\n",
    "        loss = self.losses[self.reading_pointer]\n",
    "\n",
    "        #so that pointer is in range of self.max_experience and will point to the older experience while full\n",
    "        self.reading_pointer = (self.reading_pointer + 1) % min(self.max_experience, self.num_experience)\n",
    "        return loss"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Test your implementation of ReplayBuffer!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Test your implementation of ReplayBuffer!\n",
    "\n",
    "replay = ReplayBuffer()\n",
    "loss = 5\n",
    "replay.write_experience(loss)\n",
    "if (replay.read_experience() - loss < 1e-2):\n",
    "    print(\"Your implementation is correct!\")\n",
    "else:\n",
    "    print(\"Something went wrong, please try again!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "def train_agent_with_replay(env, agent, optimizer_func, replay, mode=1, training_mode=\"write\", num_gradient_steps=1000, num_trials=100):\n",
    "    \"\"\"Training for agent in changing colorful environment.\n",
    "    Observe that training happens for one particular mode.\n",
    "\n",
    "    Inputs:\n",
    "    - env (ChangingEnv): environment.\n",
    "    - agent (ActorCritic): particular instance of Actor Critic agent to train.\n",
    "    - optimizer_func (torch.optim.Optimizer): optimizer to use for training.\n",
    "    - replay (ReplayBuffer): replay buffer which is used during training.\n",
    "    - mode (int, default = 1): mode of the environment.\n",
    "    - training_mode (str, default = \"write\"): training mode with replay buffer (\"write\", \"read\").\n",
    "    - num_gradient_steps (int, default = 1000): number of gradient steps to perform.\n",
    "    - num_trials (int, default = 100): number of times the agent is exposed to the environment per gradient step to be trained.\n",
    "    \"\"\"\n",
    "    # Reset environment\n",
    "    state = env.reset(mode=mode)\n",
    "\n",
    "    # Define optimizer\n",
    "    optimizer = optimizer_func(agent.parameters(), agent.learning_rate, eps=1e-5)\n",
    "\n",
    "    # Initialize TQDM progress bar\n",
    "    with tqdm(total=num_gradient_steps) as pbar:\n",
    "        for index in range(num_gradient_steps):\n",
    "            # For storing variables for training\n",
    "            log_probs = []\n",
    "            values = []\n",
    "            rewards = []\n",
    "            entropy_term = torch.tensor(0.)\n",
    "\n",
    "            # Start conditions\n",
    "            preceding_reward = torch.Tensor([0])\n",
    "            preceding_action = torch.Tensor([0, 0])\n",
    "\n",
    "            for trial in range(num_trials):\n",
    "                # State + reward + one-hot encoding of action; notice that we normalize state before pass to agent!\n",
    "                full_state = torch.cat((torch.from_numpy(state.flatten() / 255).float(), preceding_reward, preceding_action), dim=0)\n",
    "                value, policy_logits = agent(full_state)\n",
    "                value = value.squeeze(0)\n",
    "\n",
    "                # Sample action from policy\n",
    "                dist = torch.distributions.Categorical(logits=policy_logits.squeeze(0))\n",
    "                action = dist.sample()\n",
    "\n",
    "                # Perform action to get reward and new state\n",
    "                new_state, reward, _ = env.step(action)\n",
    "\n",
    "                # We normalize reward too\n",
    "                reward /= 4\n",
    "\n",
    "                # Update preceding variables\n",
    "                preceding_reward = torch.Tensor([reward])\n",
    "                preceding_action = F.one_hot(action, num_classes=2).float()\n",
    "                state = new_state\n",
    "\n",
    "                # For training\n",
    "                log_prob = dist.log_prob(action)\n",
    "                entropy = dist.entropy()\n",
    "                rewards.append(reward)\n",
    "                values.append(value)\n",
    "                log_probs.append(log_prob)\n",
    "                entropy_term += entropy\n",
    "\n",
    "            # Calculating loss\n",
    "            Qval = 0\n",
    "            Qvals = torch.zeros(len(rewards))\n",
    "            for t in reversed(range(len(rewards))):\n",
    "                Qval = rewards[t] + agent.discount_factor * Qval\n",
    "                Qvals[t] = Qval\n",
    "            values = torch.stack(values)\n",
    "            log_probs = torch.stack(log_probs)\n",
    "            advantage = Qvals - values\n",
    "            actor_loss = (-log_probs * advantage.detach()).mean()\n",
    "            critic_loss = advantage.pow(2).mean()\n",
    "            entropy_term = entropy_term / num_trials\n",
    "\n",
    "            # Loss incorporates actor/critic terms + entropy\n",
    "            loss = actor_loss + agent.state_value_estimate_cost * critic_loss - agent.entropy_cost * entropy_term\n",
    "\n",
    "            optimizer.zero_grad()\n",
    "            loss.backward(retain_graph=True)\n",
    "            optimizer.step()\n",
    "\n",
    "            # Write this training example into memory\n",
    "            if training_mode == \"write\":\n",
    "                replay.write_experience(loss)\n",
    "\n",
    "            # Retrieve previous experience\n",
    "            if training_mode == \"read\":\n",
    "                replay_loss = replay.read_experience()\n",
    "                optimizer.zero_grad()\n",
    "                replay_loss.backward(retain_graph=True)\n",
    "                optimizer.step()\n",
    "\n",
    "            # Update progress bar\n",
    "            pbar.update(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "First, we are going to train the new agent in the first mode using the writing mode of the replay buffer. Then, during the training in the second mode, we will incorporate reading from this replay buffer and observe whether it impacts the agent's performance.\n",
    "\n",
    "The training time will take around 3 minutes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Make sure you execute this cell to observe the plot!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Make sure you execute this cell to observe the plot!\n",
    "\n",
    "set_seed(42)\n",
    "\n",
    "#define environment\n",
    "env = ChangingEnv()\n",
    "replay = ReplayBufferSolution()\n",
    "\n",
    "#define agent and optimizer\n",
    "agent = ActorCritic(hidden_size = 100)\n",
    "optimizer_func = optim.RMSprop\n",
    "\n",
    "#train agent\n",
    "train_agent_with_replay(env, agent, optimizer_func, replay)\n",
    "\n",
    "rewards, max_rewards = evaluate_agent(env, agent, num_evaluation_trials = 5000)\n",
    "plot_confusion_matrix(rewards, max_rewards)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Great! We've trained the agent in the first mode and saved the experience in the replay buffer. Now, let us change the mode to \"read\" and train the agent in the second mode while replaying the saved experience with each gradient step of the new one. The observed plot is the confusion matrix for the second mode."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "set_seed(42)\n",
    "\n",
    "train_agent_with_replay(env, agent, optimizer_func, replay, mode = 2, training_mode = \"read\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Make sure you execute this cell to observe the plot!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Make sure you execute this cell to observe the plot!\n",
    "set_seed(42)\n",
    "\n",
    "rewards, max_rewards = evaluate_agent(env, agent, mode = 2, num_evaluation_trials = 5000)\n",
    "plot_confusion_matrix(rewards, max_rewards, mode = 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Make sure you execute this cell to observe the plot!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Make sure you execute this cell to observe the plot!\n",
    "\n",
    "set_seed(42)\n",
    "\n",
    "rewards, max_rewards = evaluate_agent(env, agent, num_evaluation_trials = 5000)\n",
    "plot_confusion_matrix(rewards, max_rewards)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Perfect match!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Submit your feedback\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Submit your feedback\n",
    "content_review(f\"{feedback_prefix}_experience_again\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "---\n",
    "# Summary\n",
    "\n",
    "*Estimated timing of tutorial: 40 minutes*\n",
    "\n",
    "Here we have learned:\n",
    "\n",
    "- Reinforcement learning also suffers from forgetting after learning a new distribution.\n",
    "- Replay is a biologically-inspired way to learn from memories of past actions and rewards, thus preventing forgetting."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "include_colab_link": true,
   "name": "W2D4_Tutorial5",
   "provenance": [],
   "toc_visible": true
  },
  "kernel": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}