\documentclass{article}
\usepackage{fullpage}
\usepackage{color}
\usepackage{amsmath}
\usepackage{url}
\usepackage{verbatim}
\usepackage{graphicx}
\usepackage{parskip}
\usepackage{amssymb}
\usepackage{nicefrac}
\usepackage{listings} % For displaying code
\usepackage{algorithm2e} % pseudo-code
% Answers
\def\ans#1{\par\gre{Answer: #1}}
% Colors
\definecolor{blu}{rgb}{0,0,1}
\def\blu#1{{\color{blu}#1}}
\definecolor{gre}{rgb}{0,.5,0}
\def\gre#1{{\color{gre}#1}}
\definecolor{red}{rgb}{1,0,0}
\def\red#1{{\color{red}#1}}
\def\norm#1{\|#1\|}
% Math
\def\R{\mathbb{R}}
\def\argmax{\mathop{\rm arg\,max}}
\newcommand{\argmin}[1]{\mathop{\hbox{argmin}}_{#1}}
\newcommand{\mat}[1]{\begin{bmatrix}#1\end{bmatrix}}
\newcommand{\alignStar}[1]{\begin{align*}#1\end{align*}}
\def\half{\frac 1 2}
\def\cond{\; | \;}
% LaTeX
\newcommand{\fig}[2]{\includegraphics[width=#1\textwidth]{a2f/#2}}
\newcommand{\centerfig}[2]{\begin{center}\includegraphics[width=#1\textwidth]{a2f/#2}\end{center}}
\def\items#1{\begin{itemize}#1\end{itemize}}
\def\enum#1{\begin{enumerate}#1\end{enumerate}}
\begin{document}
\title{CPSC 540 Assignment 2 (due February 7 at midnight)}
\author{}
\date{}
\maketitle
\vspace{-4em}
%\red{We are providing solutions because supervised learning is easier than unsupervised learning, and so we think having solutions available can help you learn. However, the solution file is meant for you alone and we do not give permission to share these solution files with anyone. Both distributing solution files to other people or using solution files provided to you by other people are considered academic misconduct. Please see UBC's policy on this topic if you are not familiar with it:\\
%\url{http://www.calendar.ubc.ca/vancouver/index.cfm?tree=3,54,111,959}\\
%\url{http://www.calendar.ubc.ca/vancouver/index.cfm?tree=3,54,111,960}}
The assignment instructions are the same as for the previous assignment, but for this assignment you can work in groups of 1-3. However, please only hand in one assignment for the group.
\blu{\enum{
\item Name(s):
\item Student ID(s):
}}
\section{Convexity}
\subsection{From Definitions}
\blu{Show that the following functions are convex, by only using one of the definitions of convexity. In other words, don't use the ``operations that preserve convexity" or other convexity results stated in class,, but instead show that the function follows from one of the definitions of convexity we gave in class}:\footnote{That $C^0$ convex functions are below their chords, that $C^1$ convex functions are above their tangents, or that $C^2$ convex functions have a positive semidefinite Hessian.}
\enum{
\item Maximum function: $f(w) = \max_{j \in \{1,2,\dots,d\}}w_j$.
\item Gaussian NLL as a function of the mean: $f(\mu) = \half \sum_{i=1}^n(x^i - \mu)^\top \Sigma^{-1}(x^i - \mu) + \frac n 2 \log|\Sigma|$ (for $\Sigma \succ 0$).
\item Probit regression: $f(w) = \sum_{i=1}^n -\log p(y_i | w, x_i)$ (where the probability is defined as in the last assignment).
}
Hint: Max and absolute value are not differentiable in general, so you cannot use the Hessian for functions involving those operations. For the probit question, you can assume that $PDF(z) \geq - z\cdot \text{CDF}(z)$ (where PDF$(z)$ is the PDF of a standard normal and CDF$(z)$ is the CDF).
\subsection{Using Operations that Preserve Convexity}
\blu{Show that the following functions are convex (you can use results from class and operations that preserve convexity if they help)}:
\enum{
\setcounter{enumi}{3}
\item Poisson regression with $p$-norm regularization: $f(w) = \sum_{i=1}^n\left[ -y^iw^\top x^i + \exp(w^\top x^i)\right] + \lambda\norm{w}_p$.
\item Support vector regression: $f(w) = \sum_{i=1}^N\max\{0, |w^\top x_i - y_i| - \epsilon\} + \frac{\lambda}{2}\norm{w}_2^2$.
\item 3 largest-magnitude elements: $f(w) = \max_{\{ijk | i \neq j, i \neq k, j \neq k\}}\{|w_i| + |w_j| + |w_k|\}$.
}
\subsection{Strict and Strong Convexity}
\blu{Which of the functions 1-6 in the previous subsections are strongly convex?}
\section{Discrete and Gaussian Variables}
\subsection{MLE for General Discrete Distribution}
Consider a density estimation task, where we have two variables ($d=2$) that can each take one of $k$ discrete values. For example, we could have
\[
X = \mat{1 & 3\\4 & 2\\$k$ & 3\\1 & $k-1$}.
\]
The likelihood for example $x^i$ under a general discrete distribution would be
\[
p(x^i_1, x^i_2 \cond \Theta) = \theta_{x_1^i,x_2^i},
\]
where $\theta_{c_1,c_2}$ gives the probability of $x_1$ being in state $c_1$ and $x_2$ being in state $c_2$, for all the $k^2$ combinations of the two variables. In order for this to define a valid probability, we need all elements $\theta_{c_1,c_2}$ to be non-negative and they must sum to one, $\sum_{c_1=1}^k\sum_{c_2=1}^k \theta_{c_1,c_2} = 1$.
\enum{
\item Given $n$ training examples, \blu{derive the MLE for the $k^2$ elements of $\Theta$}.
\item If we had separate parameter $\theta_{c_1}$ and $\theta_{c_2}$ for each variables, a reasonable choice of a prior would be a product of Dirichlet distributions,
\[
p(\theta_{c_1},\theta_{c_2}) \propto \theta_{c_1}^{\alpha_{c_1} - 1}\theta_{c_2}^{\alpha_{c_2} - 1}.
\]
For the general discrete distribution, a prior encoding the same assumptions would be
\[
p(\theta_{c_1,c_2}) \propto \theta_{c_1,c_2}^{\alpha_{c_1} + \alpha_{c_2} - 2}.
\]
\blu{Derive the MAP estimate under this prior} (assuming we use $k^2$ variables to parameterize $\Theta$)
\item We often use discrete distributions as parts of more-complicated distributions (like mixture models and graphical models). In these cases we often need to fit a weighted NLL for the form
\[
f(\Theta) = -\sum_{i=1}^n v_i \log p(x_1^i, x_2^i \cond \Theta),
\]
with $n$ non-negative weights. \blu{What is the MLE for the $k^2$ elements of $\Theta$ under this weighting.}
}
Hint: it may be convenient to write the discrete likelihood for an example $i$ in the form
\[
p(x^i \cond \Theta) = \prod_{c \in [k]^2}\theta_c^{\mathcal{I}[x^i = c]},
\]
where $c$ is a vector containing $(c_1,c_2)$, $[x^i = c]$ evaluates to 1 if all elements are equal, and $[k]^2$ is all ordered pairs $(c_1,c_2)$. You can use a Lagrangian to enforce the sum-to-1 constraint on the log-likelihood, and you may find it convenient to define $n_c = \sum_{i=1}^n \mathcal{I}[x^i = c]$.
\subsection{Gaussian Self-Conjugacy and Posterior Convergence Rate}
Consider $n$ IID samples $x^i$ distributed according to a Gaussian with mean $\mu$ and covariance $\sigma^2 I$,
\[
x^i \sim \mathcal{N}(\mu, \sigma^2 I).
\]
Assume that $\mu$ itself is distributed according to a Gaussian
\[
\mu \sim \mathcal{N}(\mu_0,\Sigma_0),
\]
with mean $\mu_0$ and (positive-definite) covariance $\Sigma_0$. In this setting, the posterior for $\mu$ also follows a Gaussian distribution.\footnote{Because of the prior and posterior coming from the same family, we say that the Gaussian distribution is the ``conjugate prior'' for the Gaussian mean parameter (we'll formally discuss conjugate priors later in the course). Another reason the Gaussian distribution is important is that is the only (non-trivial) continuous distribution that has this ``self-conjugacy'' property.}
\blu{Derive the form of the posterior distribution, $p(\mu\cond X, \sigma^2, \mu_0, \Sigma_0)$.}\\
Hint: the posterior is a product of Gaussian densities.
\subsection{Generative Classifiers with Gaussian Assumption}
Consider the 3-class classification dataset in this image:
\centerfig{.4}{sample}
In this dataset, we have 2 features and each colour represents one of the classes. Note that the classes are highly-structured: the colours each roughly follow a Gausian distribution plus some noisy samples.
Since we have an idea of what the features look like for each class, we might consider classifying inputs $x$ using a \emph{generative classifier}. In particular, we are going to use Bayes rule to write
\[
p(y^i=c\cond x^i,\Theta) = \frac{p(x^i\cond y^i=c, \Theta) \cdot p(y^i=c\cond\Theta)}{p(x^i\cond\Theta)},
\]
where $\Theta$ represents the parameters of our model. To classify a new example $\tilde{x}^i$, generative classifiers would use
\[
\hat{y}^i = \argmax_{y \in \{1,2,\dots,k\}} p(\tilde{x}^i\cond y^i=c,\Theta)p(y^i=c\cond\Theta),
\]
where in our case the total number of classes $k$ is $3$.\footnote{The denominator $p(\tilde{x}^i\cond\Theta)$ is irrelevant to the classification since it is the same for all $y$.}
% and $\theta_c$ is the set of parameters associated class $c$.
Modeling $p(y^i=c\cond\Theta)$ is easy: we can just use a $k$-state categorical distribution,
\[
p(y^i = c \cond \Theta) = \theta_c,
\]
where $\theta_c$ is a single parameter for class $c$. The maximum likelihood estimate of $\theta_c$ is given by $n_c/n$, the number of times we have $y^i = c$ (which we've called $n_c$) divided by the total number of data points $n$.
Modeling $p(x^i \cond y^i =c, \Theta)$ is the hard part: we need to know the \emph{probability of seeing the feature vector $x^i$ given that we are in class $c$}. This corresponds to solving a density estimation problem for each of the $k$ possible classes.
To make the density estimation problem tractable, we'll assume that the distribution of $x^i$ given that $y^i=c$ is given by a $\mathcal{N}(\mu_c,\Sigma_c)$ Gaussian distribution for a class-specific $\mu_c$ and $\Sigma_c$,
\[
p(x^i \cond y^i=c, \Theta) = \frac{1}{(2\pi)^{\frac{d}{2}}|\Sigma_c|^{\half}}\exp\left(-\half (x^i-\mu_c)^T\Sigma_c^{-1}(x^i-\mu_c)\right).
\]
Since we are distinguishing between the probability under $k$ different Gaussians to make our classification, this is called \emph{Gaussian discriminant analysis} (GDA). In the special case where we have a constant $\Sigma_c = \Sigma$ across all classes it is known as \emph{linear discriminant analysis} (LDA) since it leads to a linear classifier between any two classes (while the region of space assigned to each class forms a convex polyhedron as in $k$-means clustering and softmax classification). Another common restriction on the $\Sigma_c$ is that they are diagonal matrices, since this only requires $O(d)$ parameters instead of $O(d^2)$ (corresponding to assuming that the features are independent univariate Gaussians given the class label).
Given a dataset $\mathcal{D}=\{(x^i, y^i)\}_{i=1}^n$, where $x^i\in\R^d$ and $y^i\in\{1,\ldots,k\}$, the maximum likelihood estimate (MLE) for the $\mu_c$ and $\Sigma_c$ in the GDA model is the solution to
\[
\argmax_{\mu_1,\mu_2,\dots,\mu_k,\Sigma_1,\Sigma_2,\dots,\Sigma_k} \prod_{i=1}^n p(x^i \cond y^i, \mu_{y^i},\Sigma_{y^i}).
\]
This means that the negative log-likelihood will be equal to
\alignStar{
- \log p(X\cond y,\Theta) & = -\sum_{i=1}^n \log p(x^i , y^i \cond \mu_{y^i},\Sigma_{y^i})\\
& = \sum_{i=1}^n \frac{1}{2}(x^i - \mu_{y^i})^T\Sigma_{y^i}^{-1}(x^i - \mu_{y^i}) + \half\sum_{i=1}^n \log|\Sigma_{y^i}| + \text{const.}
}
In class we derived the MLE for this model under the assumption that we use full covariance matrices and that each class has its own covariance.
\enum{
\item \blu{Derive the MLE for the GDA model under the assumption of \emph{common diagonal covariance} matrices}, $\Sigma_c = D$ ($d$ parameters). (Each class will have its own mean $\mu_c$.)
\item \blu{Derive the MLE for the GDA model under the assumption of \emph{individual scaled-identity} matrices}, $\Sigma_c = \sigma_c^2 I$ ($k$ parameters).
\item When you run \emph{example\_generative} it loads a variant of the dataset in the figure that has 12 features and 10 classes. This data has been split up into a training and test set, and the code fits a $k$-nearest neighbour classifier to the training set then reports the accuracy on the test data (around $\sim 63\%$ test error). The $k$-nearest neighbour model does poorly here since it doesn't take into account the Gaussian-like structure in feature space for each class label. Write a function \emph{gda} that fits a GDA model to this dataset (using individual full covariance matrices). \blu{Hand in the function and report the test set accuracy}.
\item In this question we would like to replace the Gaussian distribution of the previous problem with the more robust multivariate-t distribution so that it isn't influenced as much by the noisy data.
Unlike the previous case, we don't have a closed-form solution for the parameters. However, if you run \emph{example\_student} it generates random noisy data and fits a multivariate-t model. By using the \emph{studentT} model, write a new function \emph{tda} that implements a generative model that is based on the multivariate-t distribution instead of the Gaussian distribution. \blu{Report the test accuracy with this model.}
}
Hints: you may be able to substantially simplify the notation in the MLE derivations if you use the notation $\sum_{i \in y_c}$ to mean the sum over all values $i$ where $y^i = c$. Similarly, you can use $n_c$ to denote the number of cases where $y_i = c$, so that we have $\sum_{i \in y_c}1 = n_c$. Note that the determinant of a diagonal matrix is the product of the diagonal entries, and the inverse of a diagonal matrix is a diagonal matrix with the reciprocals of the original matrix along the diagonal.
For the implementation you can use the result from class regarding the MLE of a general multivariate Gaussian. At test time for GDA, you may find it more numerically reasonable to compare log probabilities rather than probabilities of different classes, and you may find it helpful to use the \emph{logdet} function to compute the log-determinant in a more numerically-stable way than taking the log of the determinant. (Also, don't forget to center at training and test time.)
For the last question, you may find it helpful to define an empty array that can be filled with $k$ \emph{DensityModel} objects using:
\begin{verbatim}
subModel = Array{DensityModel}(undef,k)
\end{verbatim}
\section{Mixture Models and Expectation Maximization}
\subsection{Gaussian Mixtures Models}
The script \emph{example\_gaussian} fits a Gaussian distribution to a dataset that is not uni-modal, and giving a bad fit. Implement the EM for fitting a mixture of Gaussians to this data. \blu{Hand in your code and the updated plot when using a mixture of 3 Gaussians}.
Hint: you may want to start by implementing the PDF (``predict'') function; you can then use this with the monotonicity of EM to debug your implementation.
It is possible that $\Sigma_k^{t+1}$ may not be positive-definite, if you encounter this the standard fixes are to remove such clusters or use a MAP estimate for $\Sigma$ where you add a small multiple of the identity matrix to the estimate.
\subsection{Poisson Mixture Model}
Consider a density estimation with examples $x^i \in \{\mathbb{Z}_{\geq 0}\}^d$ representing \emph{counts} of $d$ variables. In this setting, a natural way to model an individual variable $x^i_j$ would be with a Poisson distribution,\footnote{Note that the Poisson distribution makes certain assumptions about count data, which may not always be true, so in practice you may want to think about other possible likelihoods.}
\[
p(x^i_j = \alpha \; | \; \lambda_j) = \frac{\lambda_j^\alpha e^{-\lambda_j}}{\alpha!}.
\]
However, if we assume this structure for each variable then the variables would be independent. One way to model dependent count variables would be with a mixture of independent Poisson distributions,
\[
p(x^i| \Theta) = \sum_{c = 1}^k \theta_c \prod_{j=1}^d p(x_j^i \; | \; \lambda_{jc}),
\]
where $\Theta$ contains all the $\theta_c$ and $\lambda_{jc}$ values.
\blu{Derive the EM update for this}.
Hint: most of the work has been done for you in the EM notes on the course webpage.
\section{Very-Short Answer Questions}
Give a short and concise 1-sentence answer to the below questions.
\enum{
\item What is a situation where we can violate the golden rule on our test set, and still have it be a reasonable approximation of test error?
\item Suppose we have a way to generate new IID samples for our problem. If we are fitting a model with SGD, what would be the key advantage of using new IID samples as opposed to sampling from a fixed training set?
\item Is the minimum of a convex function achieved by a unique input value? What about a strictly-convex function or a strongly-convex function?
\item How can we use density estimation for supervised learning?
\item How many parameters does the general discrete distribution have when each feature is a categorical variable that can take up to $k$ values.
\item What is the relationship between the covariance matrix $\Sigma$ and the precision matrix $\Theta$ of a multivariate Gaussian?
\item Suppose we run the graphical LASSO method and it returns a tri-diagonal precision matrix. What would the graph look like?
\item Suppose we have a lot of extreme outliers in our dataset. Why is this less of a problem for a mixture of Gaussians than if we use a single Gaussian?
\item Why do the $\pi_c$ need to be a convex combination in mixture models?
\item What is the relationship between the supervised naive Bayes model and the unsupervised mixture of Bernoullis model?
\item What is the difference between ``missing at random'' and ``missing completely at random''?
\item What is an advantage of the Epanechnikov kernel over the Gaussian kernel.
\item How does parameter tieing allow us to model training examples that have different sizes?
}
\section{Project Proposal}
For the final part of this assignment, you must a \blu{submit a project proposal} for your course project. The proposal should be a maximum of 2 pages (and 1 page or half of a page is ok if you can describe your plan concisely). The proposal should be written for me and the TAs, so you don't need to introduce any ML background but you will need to introduce non-ML topics. The projects must be done in groups of 2-3, and will be submitted separately from the assignment.
There is quite a bit of flexibility in terms of the type of project you do, as I believe there are many ways that people can make valuable contributions to research. However, note that ultimately the project will have three parts:
\enum{
\item A very short paper review summarizing the pros and cons of a particular paper on the topic (due with Assignment 3).
\item A short literature review summarizing at least 10 papers on a particular topic (due with Assignment 4).
\item A final report containing at most 6 pages of text (the actual document can be longer due to figures, tables, references, and proofs) that emphasizes a particular ``contribution" (i.e., what doing the project has added to the world).
}
The reason for this, even though it's strange for some possible projects, is that this is the standard way that results are communicated to the research community.
\blu{The three mains ingredients of the project proposal are:
\begin{enumerate}
\item What problem you are focusing on.
\item What you plan to do.
\item What will be the ``contribution".
\end{enumerate}
}
Also, note that for the course project that negative results (i.e., we tried something that we thought we would work in a particular setting but it didn't work) are acceptable (and often unavoidable).
Here are some standard project ``templates" that you might want to follow:
\items{
\item \textbf{Application bake-off}: you pick a specific application (from your research, personal interests, or maybe from Kaggle) or a small number of related applications, and try out a bunch of techniques (e.g., random forests vs. logistic regression vs. generative models). In this case, the contribution would be showing that some methods work better than others for this specific application (or your contribution could be that everything works equally well/badly).
\item \textbf{New application}: you pick an application where where people aren't using ML, and you test out whether ML methods are effective for the task. In this case, the contribution would be knowing whether ML is suitable for the task.
\item \textbf{Scaling up}: you pick a specific machine learning technique, and you try to figure out how to make it run faster or on larger datasets (for example, how do we apply kernel methods when $n$ is very large). In this case, the contribution would be the new technique and an evaluation of its performance, or could be a comparison of different ways to address the problem.
\item \textbf{Improving performance}: you pick a specific machine learning technique, and try to extend it in some way to improve its performance (for example, how can we efficiently use non-linearity within graphical models). In this case, the contribution would be the new technique and an evaluation of its performance.
\item \textbf{Generalization to new setting}: you pick a specific machine learning technique, and try to extend it to a new setting (for example, making a graphical-model version of random forests). In this case, the contribution would be the new technique and an evaluation of its performance, or could be a comparison of different ways to address the problem.
\item \textbf{Perspective paper}: you pick a specific topic in ML, read a larger number of papers on the topic, then write a report summarizing what has been done on the topic and what are the most promising directions of future work. In this case, the contribution would be your summary of the relationships between the existing works, and your insights about where the field is going.
\item \textbf{Coding project}: you pick a specific method or set of methods (like independent component analysis), and build an implementation of them. In this case, the contribution could be the implementation itself or a comparison of different ways to solve the problem.
\item \textbf{Theory}: you pick a theoretical topic (like the variance of cross-validation or the convergence of proximal stochastic gradient in the non-convex setting), read what has been done about it, and try to prove a new result (usually by relaxing existing assumptions or adding new assumptions). The contribution could be a new analysis of an existing method, or why some approaches to analyzing the method will not work.
}
The above are just suggestions, and many projects will mix several of these templates together, but if you are having trouble getting going then it's best to stick with one of the above templates. Also note that the above includes topics not covered in the course (like random forests), so there is flexibility in the topic, but the topic should be closely-related to ML.
\blu{This question is mandatory but will not be formally marked: it's just a sanity check that you have at least one project idea that fits within the scope of 540 course project, and it's an excuse for you to allocate some time to thinking about the project.} Also, there is flexibility in the choice of project topics even after the proposal: if you want to explore different topics you can ultimately choose to do a project that is unrelated to the one in your proposal/paper-review/literature-review, although it will likely be easier to do all 4 parts on the same topic.
\end{document}