vsoch

Jul 03

Principal Component Analysis (PCA)

What is Principal Component Analysis?

**Principal Components Analysis (PCA) **is an algorithm most commonly used for dimensionality reduction that finds a one dimensional subspace that best approximates a dataset. When you have data with many (possibly correlated) features, PCA finds the “principal component” that gets at the direction (think of a vector pointing in some direction) that the data approximately lies. This is the “strongest” direction of the data, meaning that the vector is the major axis of variation, and when we project the data onto this vector, we maximize the variance of our data. What does it mean to project data onto a vector? It means that we would draw our vector, and map each data point to the closest spot on the vector where the two meet at a 90 degree angle (they are orthogonal), as illustrated in the graphic below:

Maximizing variance means that, when we look at the new points (the black dots), they have maximal spread. You can imagine if we had drawn a line perpendicular to the one in the picture, the original points (they gray X’s) would map onto it and be incredibly close together. This wouldn’t capture the true direction of the data as well as our original line, above. We want to maintain as much variance as we can because we are likely using PCA as a pre-processing step to some machine learning algorithm, and if we were to take away the nice spread between points, we would be less successful to find differences between classes to build a classifier.

How does it work?

Let’s say that we have a dataset with m observations and n features to describe clowns. Each observation, m, is one clown, and each features, n, is some personality metric, a phenotype, or a performance variable. The first step is to make sure that all of our features are within the same space so that we can compare them. For example, if the mean of a behavioral metric is 5 with variance .4, and the mean of a phenotype variable is 230 with variance 23, if we were to plot these variables in the same space, they would be very far apart. So the first step is to normalize the means and variances all of our features by doing the following:

Step 1: Normalize your Data

In step 1, we calculate the mean by summing the features and dividing by the number of features. In step 2, we subtract this mean from each feature so that we essentially have mean 0. In steps 3 and 4, we re-scale each observation to have unit variance. Now if we were to plot a phenotype variable and a behavioral metric in the same space, they would be comparable.

Step 2: Maximize the Variance of the Projections

If you look at the picture above, and if you have taken linear algebra (I haven’t, but I picked this up somewhere along the way), you will notice that the perpendicular line from the point to the projection is represented by xTu (x transpose u). This means that, for any point (X) in the picture above, its projection onto the vector is xTu distance from the origin. So in this problem, the unit vector (lets call this u) is the unknown that we want to find. To maximize the variance for all projections, we would want to maximize the following:

The first term is summing the distance from the origin of each point, and we are squaring each distance because we don’t want negative distances to cancel out positive ones. This can be re-written as the second term by by multiplying xTu by xTu. We then can factor the unit vectors out of the sum, and we are left with the principal eigenvector of the data, which is also just the covariance matrix (the stuff in the parentheses). This is really neat, because it means that we can find this “maximum direction” just by solving for this principal eigenvector. I think that can be done with Lagrange multiplier, but as long as you understand it, you can use something like the eig function in Matlab to find any of the k top eigenvectors.

Once you have solved for these eigenvectors (u), you would then want to represent your data in this space, meaning projected onto these eigenvectors. This y(i) would be the “dimensionally reduced” representation of your data, in k dimensions instead of the higher, original dimensionality:

And it is notable that these new k vectors are what we refer to as the first k principal components.

Summary

With PCA, we can represent our dataset of clowns (originally described by many features) in a lower dimensional space. You can think of this as simple dimensionality reduction, or noise reduction, or even compression. Although we have a hard time ascribing a label to what the components represent or mean, in many cases this processing step leads to better performance of a classifier because it better represents some intrinsic “clowniness” that was hidden in the noise. It is also desirable because the dimensionality reduction lets us better plot our dataset in a 2D space. As humans that like to see things to understand them, this is a win :O)
·
Jul 02

Meyer Watershed Segmentation

Imagine that the pixel intensities of an image form a landscape, with lower values (closer to zero, corresponding to black) forming valleys, and higher values (closer to 1, white) forming mountains. Our image isn’t an image, in fact, it is a beautiful landscape! Meyer Watershed segmentation is a segmentation algorithm that treats our image like a landscape, and segments it by finding the very bottom of the valleys (the basins, or the watersheds - has anyone actually ever seen a real watershed?) and filling them up. When we fill up these basins without allowing for any overlap, each unique basin becomes an image region, and so we have segmented the image.

How does the algorithm work?

Here we have our image, which remember, we are thinking about like a landscape. This image is a good choice because it looks like an actual landscape:

Most images that have color aren’t going to work well off the bat, because there are three color channels. What we want is to have the image in grayscale (with one color channel) and then we we would want to look at the gradient, or the change in the values. The gradient (derivative) is going to tell us the “steepness” of our landscape. We will get something that is high value along the edges (white), and low values where it is black. For example, here is the gradient to describe the image above:

As stated previously, we are interested in finding the local minimum values of this image, which correspond to the bottom of the basins. The darker values (closer to 0) correspond to the minima, and so from looking at this image, we would guess that there is a local minima within each white circle, as well as one or two around the outside. We are then going to fill each basin until we reach the top of the basin. Specifically, here is the algorithm:

1. Choose starting markers (bin bottoms) 2. Insert neighboring pixels of each marked area into a priority queue 3. The pixel with the highest priority level is extracted from the priority queue. If the neighbors of the extracted pixel that have already been labeled all have the same label, then the pixel is labeled with their label. All non-marked neighbors that are not yet in the priority queue are put into the priority queue. 4. Redo step 3 until the priority queue is empty

Non labeled pixels are water shed lines.

Generally, watershed tends to over-segment things. However, it’s a good starting point for segmentation, because you can over-segment an image to create “superpixels” and then use another algorithm to intelligently group the superpixels into larger regions.

As a fun example. here is an example of a Meyer Watershed segmentation from a class section that I was TA for. Each colored blob corresponds to a different region. I didn’t work very hard to optimize anything, however you will generally observe that images with more well defined edges (the basin edges corresponding to values closer to 1 in the gradient image) are well defined (such as the stripes on a shirt, or glasses):

Matlab has a watershed function, although for the images above I used a different script.
·
Jul 02

Median Filtering

Median filtering is another filtering technique similar to convolution in that we define a small window ( an nxn square smaller than our image) to move across it, and based on the pixels that fall within that window, we define a new pixel value in the filtered image. For example, here we have a window of size 3X3 that we will move across our image:

In median filtering, it’s a super simple algorithm. We set the new image pixel to be the median value of the pixels that fall within our yellow window. What does this mean for what the filter does? It means that if we have outlier pixels (extremely high or low values that will make the image look speckled, often called salt and pepper noise), we get rid of those nicely. For example. here is a before and after shot of an image with salt and pepper noise that has been median filtered:

This is a huge improvement! Hooray for simple algorithms that work well.

·
Jul 02

Convolution

I remember in my first image processing course, the instructors threw around the term convolution like flying latkes in this video. It definitely happens a lot when you are new to a field that people (professors) make assumptions about the jargon of the students. To counter that phenomena, I am going to tell you about convolution in this post.

What is Convolution?

Convolution is really just a fancy term for filtering an image. The basic idea is that we choose some filter (we call this our kernel), a matrix of numbers smaller than the actual image, and we move through each pixel in the image and “apply” that filter. As we apply the filter to each pixel, we output new pixel values that encompass an entirely new (filtered) image. What does “applying” the filter mean? To help illustrate this, I’ve stolen a few slides from one of my course instructors.

Here we have our kernel (the red box) and our input image (left) and output image (right)

We place the kernel over the image, with the center of the kernel on the first pixel. To calculate the value for our new image, we multiply each pixel in the kernel by the value underneath it, and sum all of those values. The summed value goes in the same box of our new image (green).

We then move the kernel across the image, and calculate our new values one at a time until we have gone across the entire image. This process is what we call convolution.

As you can see from the illustration above, when we are at the top, left, right, and bottom edges, our kernel is going to overlap into empty space. For this reason, you will commonly need to pad your image with zeros. You can do this manually in Matlab, or use the function padarray.

As you might guess, the choice of your kernel (the filter, or matrix of numbers you move across the image) has implications for the kind of filter that results. If you apply a Gaussian kernel (meaning that the matrix of numbers follows a Gaussian distribution) this will work to blur the image. Matlab, in all of its awesomeness, has a function called fspecial that will let you create customized 2D filters, including Gaussian.

What is the algorithm for convolution?

It is pretty simple. If you walk through the pseudo-code below, it is basically the process that I just explained:

·
Jul 02

Elastic Net: Flexible Regularization for Linear Regression

As a reminder, a regularization technique applied to linear regression helps us to select the most relevant features, x, to predict an outcome y. For now, see my post about LASSO for more details about regularization. Both LASSO and elastic net, broadly, are good for cases when you have lots of features, and you want to set a lot of their coefficients to zero when building the model. How do they compare? I’m not the best to ask about this, but I read that when you have lots of correlated features, elastic net can outperform LASSO.

What is elastic net?

Elastic net is what I am calling a “flexible” regularization technique because it lets you play around with a second fudge parameter, alpha, to fluctuate between ridge regression and linear regression with LASSO that uses the L1 norm. If ridge regression and lasso regularization smooshed together and had a baby, it would be elastic net. How does this work? Here is the optimization problem for elastic net:

where:

As before, you will recognize the first term in the first equation as the least squares optimization technique, and the second is an additional penalty term that is defined in the second equation. This penalty term aims to make the resulting “chosen” features sparse, meaning that we set as many of the coefficients (betas) to zero as possible. The cool thing, however, is that we have this new term, alpha, that let’s the elastic net fluctuate between LASSO (when alpha = 1) and ridge regression (when alpha = 0). And yes, alpha is defined in the range of [0,1].

How does elastic net relate to LASSO and Ridge Regression?

Take a look at the equation above, and try plugging in 1 for alpha. The second penalty term gets a multiplier of 0, and so it largely goes away, leaving the exact same equation with the L1 norm that is, by definition, the LASSO regularization technique:

How cool is that?! Now try setting alpha = 0. We get a standard ridge regression, which (I believe) is exactly the same thing, but using the L2 instead of the L1 norm. In general, Lasso outperforms ridge regression when there are a good set of sizable effect (over many small effects). The good news is that you largely don’t need to decide - just optimize elastic net.

How do I use elastic net?

Now that we have two fudge factors / parameters to set (Lambda AND alpha), you might be panicking about the best way to go about figuring out what to do. There are likely more elegant methods, but in the past what I’ve done is a grid search for alpha within a cross validation to find Lambda. This means that, for each value of Lambda that I am testing, I calculate the error of the model for alpha from 0 to 1 in increments of .01. I then choose the value of alpha with the lowest error for that particular Lambda. Generally what happens is that you will get a sense of a neighborhood of alpha that works best for your data (in my experience with my data the values tend to be smaller, erring more toward ridge regression), and then you don’t have to run something so computationally intensive every time.

In terms of packages that you can use for elastic net, again you would want the glmnet package in R, and actually, elastic net comes via the lasso function in Matlab as well, you just specify the alpha parameter to not be equal to 1.
·
Jul 02

LASSO: Regularization for Linear Regression

From the mind of the master, we can define lasso as follows:

“The Lasso is a shrinkage and selection method for linear regression. It minimizes the usual sum of squared errors, with a bound on the sum of the absolute values of the coefficients. It has connections to soft-thresholding of wavelet coefficients, forward stagewise regression, and boosting methods.”

Let’s talk about what this actually means.

What is linear regression, and why would I use it?

**Linear regression in the most basic sense is finding a linear equation (a line) that best goes through a set of points. When I say “best goes through a set of points,” I mean that the line is positioned in the way that it is closest to every point. We find this line by minimizing the sum of squared distances to it. For a detailed review of linear regression, see this earlier post.

What problems might I face with standard linear regression?

Linear regression aims to minimize this sum of squared distances for every feature x to predict our outcome y. What if we have features that aren’t relevant to what we are trying to predict? This is a case when we need the help of a regularization technique, which does the following:
- reduces the number of predictors in the model
- reduces redundant predictors
- identifies the most important predictors
Do these things sound familiar? This is essentially **feature selection, **however unlike the standard approaches that I discussed previously, these methods are built into the model. After we use a regularization technique, we are likely to result with a better model, meaning that our estimates for the parameters yield lower errors than if using least squares alone.

How does LASSO work?

**Like with ridge regression, we are going to introduce a penalty term to our optimization function that will constrain the size of the coefficients. We are going to set as many of the coefficients to zero as we can, and this is why it is called “the lasso” (think of an old Western movie with a guy on a horse with a lasso trying to capture the best chickens out of a bunch. Same thing right?). In the equation below, you will recognize the first chunk as the least squares minimization algorithm. The second chunk is the penalty term:

What do all these variables mean?
- N is the number of observations (training data)
- yi is our response variable for observation i
- xi is the feature vector (of length p) at observation i.
- λ is a nonnegative regularization parameter, “a fudge factor,” also known as a value of Lambda (more on this later)
- The β’s are the coefficients we aim to optimize, all scalars. You will notice that we are taking the absolute value of these betas. This is called the L1 norm, or the Manhattan distance. Whenever you hear someone talk about L1/L2/L-something norms, they are just talking about different ways of calculating distances. From looking at the equation above, we know that the LASSO uses the L1 norm.
Lambda is a fudge factor… what?

Lambda is a variable that you must choose that generally controls the number of features to “zero out.” As λ increases, the number of nonzero components of β decreases, so choosing a larger value of Lambda will result in fewer features because we are penalizing more. It’s common to choose an optimal value of Lambda by way of cross validation - meaning that you run your model with several different values, and choose the one that works best.

How do I use LASSO?

Thanks to the amazing statistics department at Stanford, LASSO is available in both Matlab and R, specifically for R in a package called glmnet that also does elastic net and logistic regression. These packages are nice because they will allow you to easily create cross validated models. I have great appreciation for this, because when I used the original glmnet package for Matlab a few years back, I had to do my own cross validation.

**Summary

In a nutshell, with lasso we can do both variable selection and shrinkage. Using the L1 norms is pretty computationally efficient because the solution is convex, and it brings to light something that I read about called “The Bet on Sparsity Principle*:”

“Assume that the underlying truth is sparse and use an ℓ1 penalty to try to recover it. If you’re right, you will do well. If you’re wrong— the underlying truth is not sparse—, then no method can do well.”

[Bickel, Buhlmann, Candes, Donoho, Johnstone,Yu …*Elements of Statistical Learning]

In other words, it’s probably more likely that a small set of variables best explain the outcome. And if that isn’t the case, you would be out of luck with developing a good method to predict your outcome from the features anyway!
·
Jul 01

Expectation Maximization (EM) Algorithm

Let’s talk about jelly beans. Specifically, imagine that you took a bag of every single brand of jelly bean in the world (meaning different colors, sizes, and ingredients) and dumped them into a bin. Your bin now has some k classes of jelly bean, and being very proud of your work, you seal the bin in a space capsule along with the k empty bags and launch it into orbit.

Two hundred years later, an alien species discovers your space capsule, and postulates that the beans originated in the k bags. Being an inquisitive species, they want to re-organize the beans back into the k bags, however there is one problem. The aliens are unable to taste, and so they cannot say anything about the flavors of the beans, and they are color blind, so the beans appear to them to be different shades of grey.

Here we have an unsupervised machine learning problem (due to not knowing any labels) for which we have observed features, x (size, texture, and anything the aliens can measure), and unobserved features, z (color and taste). We call these unobserved features, z, latent or hidden variables. If we could observe this data, it would be an easy problem to solve, because we would use maximum likelihood to solve for θ that maximize the log probability logP(x,z;θ). However, since we don’t know z, we can’t do this. We assume that our unobserved features, z, can be modeled with multinomial distributions, and we want to build a probabilistic model that can still do parameter selection in the presence of this missing data. This might sound like Gaussian discriminant analysis, however the main difference is that we are using a different covariance matrix for each Gaussian, and we are modeling not with Bernoulli distributions, but with multinomial. More on this later. To build this probabilistic model we need expectation maximization.

What is Expectation Maximization?

Expectation Maximization is a two step unsupervised machine learning algorithm that iteratively uses the data to makes guesses about the latent variables, z, and then uses these guesses to estimate the parameters. We jump back and forth between guessing z and re-estimating the parameters until convergence. Specifically, we:

Repeat until convergence {
1. compute probabilities for each possible completion of the missing data using the current parameters, θ. This is the E-step
2. We use the probabilities above to create a weighted training set consisting of all possible completions of the data.
3. We then use maximum likelihood estimation to get new parameter estimates. This is the M-step.
If you think about it, this is like maximum likelihood] or MAP estimation in the case of missing information.

How does it work?

To explain the above in math, let’s pretend for a second that we can observe the missing data, z. We would model the log likelihood of our data given the class:

We would want to find a formula for each of the optimal parameters (the parameters, theta, mean, and covariance matrix) by taking the derivative with respect to each one. As a reminder, the derivative of a function gets at the rate of change, so where the derivative is equal to zero, this is where we have a maximum. This would be equivalent to the **M-step:

But do you see why we cannot do this? We don’t know anything about the class distribution, z. Was it a jelly belly distribution, Jolly Rancher, or a Brachs? This is why we have to guess the distribution of z, and the pretend that we are correct to update the parameters. Let’s re-write the above in context of the E and M step:

Until convergence… {

**E-step: **for each i,j, set

this is basically evaluating the density of a Gaussian with a particular mean and covariance at x(i), and the resulting weights that we get represent our current guesses for the values z(i).

**M-step: **update the parameters with the new weights

}

Note that the main difference between these parameters and what we had before (when we knew z) is that instead of an indicator variable, we have the weights. The indicator variables before told us exactly which Gaussian (which jelly bean bag) each data point (jelly bean) originated from. Now, we aren’t so sure, so we have weights that reflect our belief for each jelly bean.

Isn’t this like K-Means Clustering?

A little bit, because the weights represent “soft” cluster assignments as opposed to hard assignments found with k-means. And similarly to k-means, we are subject to local optima, so it’s best to run it a few times.
·
Jun 30

Region Growing

Region Growing is an algorithm for segmentation of regions in an image based on the idea that regions share some property that can be computationally measured. It’s one of those magical algorithms that is ridiculously simply, and powerful because the user can customize it so easily. In what way? We have complete freedom to choose the metric to distinguish our region(s) of interest, when we stop and start growing a region, and how we initialize the algorithm with some seed point. For example, the most simple metric that could be used is the image intensity (the pixel value). If we were to read the image below into Matlab:

We could look at the matrix of gray values, and see that the pixels making up Totoro have a value close to 0 (meaning total absence) and his eyes are close to 1 (white). Note that this is assuming that we are reading the image in **greyscale, **which gives images values in the range [0,1]. From this observation alone, we could use pixel intensity to separate Tototo from his background. How do we do that?How does region growing work?We will start by choosing a criteria that defines our region, such as image intensity. We could “hard code” a particular intensity in, (perhaps between 0 and .15 would work well?), however it is more reasonable to write a script that allows a user to select a pixel of interest from the image, and use that intensity to start region growing. In a simple image like this one, you could specify the pixel intensity to be equal to that value. In an image that isn’t so clean cut, you could specify within some range of that value. The important thing is that you have some boolean statement that represents your criteria. We then start at our selected pixel of interest, and look at all of his neighbors to determine if they meet the criteria. If they do, great! We add them to our region, and then we look at their neighbors. We keep looking at neighbors until we run out, and everyone that passed the criteria is included in the final image. More specifically, here is the general pseudo-code for region growing:

The criteria can be anything that you can dream up to measure in the image, and you can add additional constraints about the size or even shape of the region. Applied to medical imaging, region growing works pretty well for large vessel segmentation, and even lung segmentation. For example:

This is the classic picture that comes to my mind when I think of region growing, and it’s actually from a Matlab region growing script on the File Exchange. You can see that the algorithm nicely segmented the right lung in this CT, however there are some holes that should probably be filled, and we haven’t made any sort of delineation of the finer detail that you can see on the left side.
·
Jun 30

Probabilistic Graphical Models (PGM)

Flashcards - PGMs
·
Jun 30

Bayesian MAP Estimate

The **MAP **estimate (maximum a posteriori) estimate is another way to estimate the optimal parameters for a supervised prediction problem. As a reminder, when we had some training set with features x and labels, y, were using maximum likelihood to find the optimal value of θ:

What we didn’t talk about is how this model makes assumptions about our parameters, θ. We were making a frequentist **assumption that these parameters were just constant numbers that we didn’t know. There is actually another field of thought from **Bayesian statistics (these guys love probability) that would look at the problem a different way. With this exact same data, a Bayesian view would think of θ as being a random variable pulled from a distribution, p(θ), and the value(s) that were pulled are unknown. We could model this posterior distribution by taking an integral over θ, but this would be incredibly hard, and it usually can’t be done in closed form. Instead, we will use what is called the MAP estimate to approximate the posterior distribution with a single point estimate. This estimate for θ is given by:

It’s actually exactly the same as maximum likelihood except we have p(θ) (our prior beliefs about the distribution of θ) added to the end.

·
Jun 29

Feature Selection

Science these days is defined by big data. In economics, medicine, politics, just about any field, we are constantly collecting information about things and trying to build models to understand and predict our world. In this light, you can imagine that a lot of this big data is… complete junk. Let’s say that we are collecting phenotypic data about cats with the goal of predicting how ugly they are (yes, I do not like cats so much). We collect everything that we can think of, including hair and eye color, nail and hair length, height, width, mass, tail curl, etc. it’s very likely that there will be some relationship between hair length of ugliness, because hairless cats are terrifying, and longer haired cats are considered fancier (according to the Fancy Feast commercials). However, it’s unlikely that toenail length will be of any use in our model. In fact, including it will make our model very bad. This is where **feature selection comes in! Given that we are overwhelmed with data, we need methods to pick out the features that we want to use in our models.

Feature selection **is a method of model selection that helps to select the most relevant features to the learning task. If we have n features, if we were to try every possible subset of features, this would give us 2^n subsets. For only a handful of features, it might be feasible to do this exhaustive test. For many features, it is a very bad idea. The basic kinds of algorithms for feature selection are **wrapper approaches and filter approaches. Wrapper approaches generally create a loop that ‘wraps” around your learning algorithm and runs it many times to evaluate different subsets, while filter approaches use a heuristic, and calculate some goodness score S(i) for each feature, and then just choose the best. We will start by reviewing the simplest wrapper approaches: forward and backward search

Forward Search

The idea of forward search is that we start with an empty set, and repeatedly iterate through our set of features, each time evaluating the performance of our model with each feature, and at the end choosing to add the feature that has the lowest generalization error. And of course we stop adding features if we aren’t improving our ability to classify the data. More simply put:

Create an empty set, let’s call it S.

Repeat {
1. Train a classifier with S + each feature that is not currently in S, and estimate its generalization error
2. Choose the feature from above that does best, add to S.
}

Stop when error stops improving, or we run out of features, and here we output the best feature subset.

So for our problem of selecting features that are informative about cat ugliness, we would likely start with an empty subset, and then build and evaluate a classifier using each of the features alone. We would find that hair length performs best to predict ugliness, so we would add that feature to our set, and then start round two. In round two we would again add each feature (all features not currently in our subset) to our current set, S, and build and evaluate a classifier. We would then add the feature to S that improves the model. For example. we might find that hair length + eye color do better to predict ugliness than hair length alone. We would continue in this manner of adding features until we hit a desired number of features, a desired performance, or if performance stops improving. And how do we figure out when to stop adding features? It is common to use the F-ratio:

You can see that this looks at the difference between the residual sum of squares (between the old and new model), and we divide by our estimate of the variance. If the F ratio is a big number, (I think) this means that our old error was much larger than the new, and we should keep going. I the F ratio is negative, then the new model had a higher error than the old, and we should not continue. If the F ratio is zero, then we are indifferent.

Backward Search

Backward search is similar to forward search, except we start with ALL the features of our data in our subset, and remove the feature that leaves a subset that does better than previously. We of course stop if our performance stops improving, or if we reach an empty set. More simply:

Start with our set S = all features

Repeat {
1. Train a classifier with S - (minus) each single feature that is still in S, and estimate its generalization error
2. Remove the feature from S whose removal improves performance most
}

When we get to here, we hopefully have a subset of features, and haven’t removed them all.

Of course, these filter methods can be very computationally expensive because we have to run our model so many times. In the case of very large data, it can be easier to use a filter method.

Filter Feature Selection

A good example of a heuristic for filter feature selection is calculating the (absolute value of) the correlation between each feature x and the outcome variable y. This is pretty simple. More commonly, we use something called mutual information. Mutual information gets at how much we can learn about one variable from another, and the resulting calculation is between 0 and 1. A mutual information score of 1 means that knowing X tells us exactly what Y is, and it’s a good feature to have. A mutual information score of 0 says that X tells us nothing about Y, and we can throw it away. The calculation for the mutual information between X and Y is as follows:

If x and y are completely independent, meaning that X tells us nothing about Y, then p(x,y) = p(x)p(y), and log(1) goes to zero, and the resulting MI value also approaches (or is) zero. However, if x and y are completely dependent, this means that knowing X gives us a lot of information about Y, and our resulting MI value gets larger. We can therefore calculate these MI scores for each feature and Y, and use something like cross validation to decide how many k features to include in our model.

Again, for our problem of predicting cat ugliness from phenotype, we would likely find high MI values for features like hair length and eye color, and values close to zero for something like toe nail length. We would then use cross validation with different numbers of features (k=1:5, for example) to decide on the optimal value of k corresponding to the number of features to include in our model.

A Few Tips About Feature Selection
- Adding redundant variables (i.e., transforming your data) can achieve better noise reduction and better class separation. For example, if we have a feature x, by transforming it to make a new feature x_2 we aren’t adding any new information, but this new feature transformation might be better at distinguishing class labels. This is likely why kernels are so important for support vector machines - we are essentially transforming our features to infinite dimensional spaces with the goal of making them linearly separable.
- Perfect correlation or anti-correlation doesn’t say anything about variable utility for classification. Two variables could be perfectly correlated, but still very separable (for example, think of some line like Y=2x, and then shift it up a few units, Y=2X+3.
- Unfortunately, selecting a useful subset of features is not the same thing as selecting features that perform well individually. Some features are useless on their own, but when taken with others, can result in significant improvement of classification performance.
·
Jun 29

Support Vector Machines (SVMs)

Support Vector Machines (SVMs) **are supervised machine learning classifiers based on the idea that if we plot our data in some n dimensional space, we can draw a **separating hyperplane between the classes. Based on this description alone, you will guess correctly that SVMs are appropriate for distinguishing two binary classes. You might also guess that a metric that we will use to construct our model is calculating the distance of the points on either side of the hyperplane to the hyperplane itself. For any set of datapoints with appropriate labels {-1,1}, our “best” hyperplane drawn to separate the points will maximize this distance.

Parameters to Describe a SVM

We define our classifier with parameters w, and b:

If the stuff inside the parentheses (g(stuff)) is greater than or equal to zero, then we spit out a class of 1, and say that the class is -1 otherwise.

The Functional Margin… gives us confidence in our SVM?

To define the functional margin of our classifier, we would look at all of the distances between each point and our separating hyperplane, and choose the smallest one. For any one point (x,y), we define the functional margin as:

If our class label is 1, to be correct in our prediction we want this value to be a large positive number. If our class label is -1, a correct prediction coincides with a large negative number. However, do you see a problem with using the functional margin as a measure of confidence in how well our classifier predicts a class label?

The Geometric Margin Gives Us Confidence in our SVM!

The problem with using the functional margin for evaluation is that if we were to change the scaling of our parameters w or b, we would also make the output value much larger, or much smaller, but there is fact would be no change to the classifier itself. What we need to do is somehow normalize this calculation, and this is where the geometric margin comes in. The geometric margin is exactly the same as the functional, but we divide by the norm of the vectors w and b, so any change in scaling doesn’t influence the resulting value. And equivalently, the smallest geometric margin across all points is our final value.

How do we build an optimal margin classifier?

Given that we have a metric that represents the goodness of a decision boundary for our classifier, you might have guessed correctly that we will try to maximize this boundary to build our classifier. The problem is now set up as:

We are now faced with the problem that the functional margin divided by the norm of the vector w (the first line) is a non convex problem. This means that it doesn’t look like a nice parabola that we could crawl along and eventually end up at a nice maximum value. To deal with this, we are going to take advantage of the fact that we can change the scaling of both w and b without influencing the final calculating of the geometric margin. We are going to make it required that the functional margin (y hat) is equal to 1, and so since maximizing the above is equivalent to minimizing it’s inverse, the new problem becomes:

the solution being… the optimal margin classifier! **This means that, for**any particular training example (m), we are subject to the following constraint (I just subtracted the 1 from the right side in the second line above)

Remember that our separating hyperplane is going to be defined by the closest points on either side to it. This is where the name “support vector machine” comes from - these points are called the support vectors, because their functional margin is exactly equal to 1, or for the equation above, g(w) = 0. These “support vectors” are illustrated in the image below. Can you see that they are the only points to actually define the separating hyperplane? Technically, we don’t care so much about the rest of the points.

](http://www.vbmis.com/learn/wp-content/uploads/2013/06/eq112.png)The “support vectors” are where the functional margin is equal to 1, or g(w) = 0. These are the points that ultimately determine our optimal separating hyperplane.

In our optimization problem, we are only going to care about these three points. Here is where it gets a little tricky, and I’ll do my best to explain. Honestly, I never understood this very well, and so I’ll state it point blank, and if you want proofs, you can find them on your own. When we have some optimization problem subject to constraints (as we do here), the Lagrange Dual Function gives lower bounds on the optimal value of the original function (insert hand waving here). So to solve for our SVM, we construct a Lagrangian as follows:

and then we find the dual form of the problem by minimizing the above with respect to w and b, and remembering that alpha > 0 (Insert a ridiculous amount of hand waving here). We now have the dual optimization problem:

and the goal is to find the alphas that maximize the equation. The x’s in the bent parentheses represent a kernel, which is **a function that maps your data to a higher dimensional space, the idea being if the data aren’t linearly separable, when we map them to an infinite dimensional space, they can be.

An example algorithm to solve the dual problem is called Sequential Minimal Optimization (SMO). Broadly, this algorithm iterates through each training example (1 through m), and holds all parameters alpha except for the current training example constant, and the equation W alpha is optimized with respect to that training example parameter. We keep going until convergence. This method is called coordinate ascent. For the SVM, we would do the following:

Going until Convergence {
1. Pick an alpha_x and alpha_y with some heuristic
2. Reoptimize W alpha with respect to those two, and all other parameters are held constant.
}

That is probably the limit of my understanding of SVMs. I recommend reading Matlab documentation if you want a more satisfactory explanation. If you are using SVMs, it’s pretty trivial to train and test with Matlab or R functions, and you would want to try different kernels to get a sense for which works best with your data.
·
Jun 27

Gaussian Discriminant Analysis (GDA)

Gaussian Discriminant Analysis (GDA) is another generative learning algorithm that models p(x y) as a multivariate normal distribution. This means that:

The cursive N symbol is used to represent this particular distribution, which represents a nasty looking density equation that is parameterized by:
- μ0 and μ1: mean vectors
- Σ: the covariance matrix
- φ: what we vary to get different distributions
Covariance

This is a two by two matrix, and for a standard normal distribution with zero mean, we have the identity matrix. As the covariance gets larger (e.g., if we multiply it by a factor > 1), it spreads out and squashes down. As covariance gets smaller (multiply by something less than 1), the distribution gets taller and thinner. If we increase the off-diagonal entry in the covariance matrix, we skew the distribution along the line y=x. If we decrease the off-diagonal entry, we skew the distribution in the opposite direction.

Mean

In contrast, varying the mean actually translates (moves) the entire distribution. Again, with the mean as the identity matrix, we have the mean at the origin, and a change to that corresponds to moving that number of units in each direction, if you can imagine sliding around the distribution in the image below:

Writing Out the Distributions

Ok, brace yourself, here comes the ugly probability distributions! Keep in mind that the symbols are just numbers, please don’t be scared. You will recognize the first as Bernoulli, and the second and third are the probability density functions for the multivariate Gaussian:

And as we have been doing, we now want to choose the parameters with maximum likelihood estimation. In the case of multiple parameters in our function, we want to maximize the likelihood with respect to each of the parameters:

You can do different kinds of discriminant analysis in Matlab and also in R. Note that Linear Discriminant Analysis (LDA) assumes a shared covariance matrix, while Quadratic Discriminant Analysis(QDA) does not.

When to use GDA?
- if p(x y) is multivariate Gaussian (with shared Σ), then p(y x) follows a logistic function, however the converse is not true!
- GDA makes stronger modeling assumptions than logistic regression, so we would expect it to do better if our modeling assumptions are correct.
- Logistic regression makes weaker assumptions about our data, which means that it is more robust, so if we are wrong about our data being Gaussian, it will do better.
·
Jun 27

Image Registration

When we talk about registration, we are talking about a transform of one image into another image’s space by way of transforms. What does this mean exactly? If each image is a matrix of numbers, the “transform” is another matrix that we can multiply our image-1 numbers by in order for it to line up nicely with image-2. There are two kinds of registration: linear and non-linear, and the best way to explain the difference is with a putty metaphor. Let’s say that someone hands me a small box, and asks me to fit a series of Tetris pieces nicely into the bottom. I create my Tetris pieces out of clay, bake them in the oven, and try to fit them into the box. I can flip them, move them up, down, left, right, or rotate them. This is akin to a linear or a rigid registration. I can make a lot of movements, but I’m limited to translation, rotation, and flipping.

Frustrated that I can’t get it to work, I decide to be sneaky. I recreate my Tetris pieces out of clay, but I don’t bake them this time. I put them in the box in a formation that almost fits, and then I proceed to stretch one piece to double it’s width, fold a straight piece into an L, and squeeze one piece to take up less space. Ta-da! Task accomplished! This is an example of a non-linear registration, and if I were to write down the exact manipulations that I did, that would be equivalent to saving a translation matrix.** **It’s useful to save this matrix in the case that you want to apply your registration to other images.

When do I want linear vs non-linear registration?

Generally, we only want non-linear registration if we need to warp the shape or size of our data. For linear, we basically want to define an affine transformation matrix that maps one image to the other. I wrote up this handout to explain this matrix:

Unable to display PDF [Click here to download](http://www.vbmis.com/learn/wp-content/uploads/2013/06/Affine_Registration.pdf)

What is not explained in this handout is exactly how we solve for the matrix, A. We use the idea of mutual information, or maximizing the probability of both images. We want to use mutual information instead of something like least squares because with different modalities, the intensities may not be in the same scale. A very good description of the method is included in this paper (bottom of page 4):

Unable to display PDF [Click here to download](http://www.vbmis.com/learn/wp-content/uploads/2013/06/Medical-image-registration-using-mutual-information.pdf)

·

Generalized Linear Models (GLM)

If you think of regression (e.g., linear with gaussian distributions) and classification (logisitic with bernoulli) as cousins, they belong under the larger family of General Linear Models. A GLM is a broad family of models that can all be simplified to a particular equation that defines something in the “exponential family.” Basically, if you can simplify your distribution into this form, it belongs to this family. This is the equation that defines something in the exponential family:

If we choose a particular value of T, a, and b, then we define a family of distributions parameterized by η. Then when we vary the η we get different distributions within that family. And all of these families are, broadly, general linear models. So, what are each of these parameters?

η: The natural, or canonical parameter
T(y): The sufficient statistic
a(η): The log partition function
canonical response function: is the function (g) that states the mean as a function of η. So g(η) = E[T(y); η]). The inverse is the canonical link function.

###

What assumptions do we make?

We usually start with some data, including features (x) and some random variable y, and we want to predict p(y

x). To construct a GLM, we make the following three assumptions:

p(y|x) parameterized by θ is a member of the exponential family, so it can be re-written in the exponential family form above parameterized by η.
Our predicted value, our hypothesis h(x), should output the expected value of y given x: E[y x;θ]
The natural parameter η and the inputs x are linearly related

I’m not going to go through the derivation of any of these models, because it’s largely hard statistics that give me a headache, and I think that it’s more important to understand the broad concept. If you can start with a bernoulli, gaussian, poisson, or some other distribution, and get it in this form, then you know that it is a generalized linear model. And more importantly is the application. If you have a dataset of interest, you would want to use appropriate functions in your scripting language of choice (e.g., Matlab and R) to try fitting your data with different models, and evaluating how well each fits. If you are interested, here are the parameters for some common distributions, but I’d say that you will be fine just starting from the application standpoint.

Bernoulli

Bernoulli(φ) (meaning with mean φ) specifies a distribution over y ∈ {0, 1}, such that:

p(y = 1; φ) = φ
p(y = 0; φ) = 1 − φ

The parameters, defining it as a member of the exponential family and a GLM, are as follows:

Gaussian Distribution

Multinomial Distribution

Laplace Smoothing

Why do we need Laplace Smoothing?

Let’s return to the problem of using Naive Bayes to classify a recipe as “ice cream”(1) or “sorbet” (2) based on the ingredients (x). When we are calculating our posterior probabilities, or the probability of each feature x given the class, we look to our training data to tell us how many recipes of a certain class have a particular ingredient p(xi

y=1) or p(xi

y=0). Let’s say that we are classifying a new recipe, and we come across a rather exotic ingredient, “thai curry.” When we go to our training data to calculate the posterior probabilities, we don’t have any recipes with this ingredient, and so the probability is zero. When we introduce a zero into our equation calculating the p(y=1

x), we run into a problem:

The capital pi symbol means that we are multiplying, so if it’s the case that just one of those values is zero… we wind up with 0 or NaN! Oh no! Abstractly, we can’t say that just because we have not observed an event (a feature, ingredient in this case) that it is completely impossible. This is why we need Laplace Smoothing.

Why is this method called Laplace Smoothing?

You can probably guess that I’m about to tell you about a method that somehow deals with the very not ideal situation of having a posterior probability of zero. Why is it called Laplace Smoothing? This is the story, according to me, based on someone else telling me some time ago. There was this guy named Laplace who was pretty spiffy with mathematics and astronomy, and he thought about some pretty abstract stuff such as “What is the probability that the sun does not rise?” Now, if we were using Naive Bayes, we would of course calculate the posterior probabilities as follows:

p(x sun rises) = 1

p(x ~sun rises) = 0

Right? It has never been the case that the sun has not risen. We have no evidence of it! Laplace, however, realized that even if we have never observed this event, that doesn’t mean that it’s probability is truly zero. It might be something extremely tiny, and we should model it as such. How do we do that?

How do I “Laplace Smooth” my Data?

It’s very simple, so simple that I’ll just state it in words. To ensure that our posterior probabilities are never zero, we add 1 (+1) to the numerator, and we add k (+k) to the denominator, where k = the number of classes. So, in the case that we don’t have a particular ingredient in our training set, the posterior probability comes out to 1 / m + k instead of zero. Plugging this value into the product doesn’t kill our ability to make a prediction as plugging in a zero does. Applied to Naive Bayes with two classes (ice cream vs sorbet), our prediction equation (with k=2) now looks like this:

Thanks, Laplace… you da man!

Naive Bayes

Naive Bayes is a supervised, probabilistic classification method for discrete classes that makes some strong assumptions about the data, but works pretty well. If you think back to regression, regression, we were trying to predict p(y

x), and this is called a discriminative learning model. We were trying to predict the probability of a class given a feature set. What if we turn this on its head, and instead model p(x

y), or in other words, given a particular feature set, the probability of each class? This is called a generative learning model. Abstractly, we create models of each of our classes based on the features (x), and then for a new set of features that we want to classify, we ascribe the label of the model that is most likely given those features. As usual, to explain it I will set up a toy problem.

Let’s say that I really like ice cream and sorbet. I have a huge catalog of recipes for the two, and they are nicely organized in two folders. I then proceed to drop the folders, and at the same time a huge wind picks up and scatters the recipe papers everywhere! I now have no idea which recipes are for sorbet, and which are ice cream. I have thousands of them, and so manual curation is not feasible, however I’m willing to look through and label a subset of the recipes. We will call this subset my trainingset.

I then decide to get clever. I know that I can scan my recipes and have the text read automatically, so I decide to write a little script that can classify the recipe based on the ingredients. I know that ice cream and sorbet have different ingredients, namely ice cream is typically cream or milk based, and sorbet is not. I could do a really simple algorithm that slaps a label of ice cream on anything with a milk based ingredient, but I am hesitant to do that because I know that some of my sorbet recipes have mix-ins and toppings that use milk products, and my ice cream recipes most definitely use sugar and fruit. Instead, I decide that naive bayes is the way to go!

The Basic Steps

Create my training set: I take a sample of my recipes, and label them as “ice cream” or “sorbet.”
Define my features: My features are the ice cream ingredients (e.g., cream, sugar, milk, strawberry, vanilla, etc.). I contact this crazy company to get a comprehensive list of all ingredients that can be used for ice cream or sorbet. I then put them in alphabetical order (this list is called my vocabulary), and write a little script that works with my text reader. As I scan each recipe, my script creates a vector of 0’s and 1’s that correspond to if the ingredient is in the recipe (1) or not (0) (Note that using 0s and 1s is a version of Naive Bayes called a Bernoulli Naive Bayes model). I do this for each recipe in my training set.

**Model p(x

y):** I now want to model the probability of a particular combination of ingredients given that I have an ice cream or sorbet recipe. For example, if my ingredients are “cream, vanilla, sugar, salt, and eggs” the p(ingredients

ice cream) would be very high, while the p(ingredients

sorbet) would be close to zero. You can see why I wouldn’t want to do some sort of gaussian discriminant analysis… there are way too many features! This is where we make drumroll

The Naive Bayes Assumption

We assume that each of our x features is conditionally independent given y. This means that if I tell you a particular recipe has the ingredient “strawberry,” this tells you nothing about whether or not it has “cream.” Now, for particular pairs of ingredients this might be true, however we can see right away that all of our variables are not conditionally independent. If you tell me that a recipe has “eggs,” that increases my belief that it has some milk-based ingredient, because it’s likely an ice cream. However, Naive Bayes is (unfortunately?) still applied to many problem spaces for which conditional independence is not the case. Let’s continue the example. We model the p(x

y) as:

We again would want to write an equation that represents the likelihood of our data, and then maximize it. This equation should is parameterized by the following:

φi y=1 = p(xi = 1 y = 1) The fraction of our ice cream recipes (y=1) that have an ingredient i
φi y=0 = p(xi =1 y= 0 ) The fraction of our sorbet recipes (y=0) that have an ingredient i
φy = p(y = 1) The fraction of all recipes that are ice cream (y=1)

More explicitly, we define these as:

However, I find it much easier and more intuitive to define them based on the textual descriptions above. The notation says that we add 1 to the sum if an ingredient j is present for a recipe i AND (^) the class of the recipe is ice cream (y=1). Basically, this is counting!

So if I pick up a recipe without a label, I would define my features for the recipe (the vector of 1’s and 0’s that corresponds to the absence and presence of each ingredient), and then I would use the entire labeled data to calculate each of the above (this is like the “training” step). I would then want to know, is my unlabeled recipe more likely to be ice cream or sorbet? To answer this question I need to know:

p(y = 1 x) The probability of ice cream given my features
p(y = 0 x) The probability of sorbet given my features

And now drumroll we use Bayes Rule:

This is a lot more simple than the equation suggests, because at the end of the day we are just calculating two numbers (each a probability of a class), and assigning the one that is larger. It would be arduous to do by hand, but is relatively feasible to write a script to do for you. To put this in the context of “training” and “testing” - training would encompass calculating our posterior probabilities from some training subset, and testing would mean coming up with a set of predictions for a test set, and then evaluating those predictions with standard ROC curve metrics.

How do we deal with more than two classes?

For the above example, we model our problem with features that fit a Bernoulli distribution (0 or 1), the Bernoulli Naive Bayes **model. If you wanted to use features with more than two values (e.g., k={1,2,3,…n} you would use a **Multinomial Naive Bayes model, and simply model them as multinomial. The parameterization for each word is a multinomial distribution and not a Bernoulli, and we instead are modeling a distribution for words in each position of the document. We are making the assumption that the word in position 1 is independent of the word in position 2.

In the case of continuous values, you could simply discretize them, or define bins based on ranges of numbers, each bin corresponding to an integer.

Summary

Overall, even if the independence assumption isn’t completely true, Naive Bayes is incredibly efficient and easy to put together, and works very well with problems that have many (weakly) relevant features. However, be careful if you have many highly correlated features, in which case the model will not work as well.

Jun 22

Diffusion Tensor Imaging (DTI)

Diffusion Tensor Imaging looks at white matter integrity. It is a MRI based method that uses diffusion properties of water molecules to construct white matter fibers (anatomical connections) between brain regions. There are many metrics of white matter tracts you can calculate from the data, but the two most commonly done things are tractography, or seeing how things are connected and how strongly, or Fractional Anisotrophy, or calculating the level of perfusion at each voxel.

This is currently the only non-invasive method for mapping anatomical connections in the human brain, and the disadvantages are that you can’t get a sense of the direction of the connection, crossing fibers are challenging to pick up on, and smaller fibers are harder to detect. We are getting better with higher resolution scanners, but it isn’t perfect, of course.

How does the scan work?
1. Standard 90 deg excite
2. Gradient pair (G,-G) has no net effect on stationary spin
3. Spins that have moved experience a mean field that differs from stationary. This dephases them and causes T2 to decay
4. The amount of the decay depends on the distance moved during time DT (diffusion time)
What software should I use to process data?

The two suites I have experience with are FSL (via DTI-fit and BEDPOST-X for tractography) and the Diffusion II toolbox in SPM, now available under an umbrella of packages called SPMTools. There are many more GUI based solutions available, and a good overview can be seen here.
·
Jun 22

K-Means Clustering

K-Means clustering is a bread and butter method of unsupervised clustering in machine learning, and I love it because it’s so intuiitive, and possibly because I learned about it in one of my favorite classes at Stanford, BMI 214 with Russ Altman my first year. How does it work?

When would I want to use K-Means?

We use K-means when we want to uncover classes in a large, unlabeled dataset defined by a set of features. For example, let’s say that a lemur invades your pantry, and mixes all of your cereals into one giant mess. While some people like their cereal as a pot-potpourri, you are horrified, and want nothing more than a bowl with one kind of morsel floating around. Let’s also say that you have a machine that is able to scan a piece of cereal, and extract features that describe the size, shape, ingredients, etc. You realize that if you can extract features for a sample of your cereal and define groups based on those features, you can then use your classifier with the machine to automatically extract features and classify the rest of the cereal. Thank goodness, because having to manually go through each piece would be terribly time consuming. This is, of course, an unsupervised problem because we have no idea about the class labels, and non-parametric because the model is based on the data itself. We have to keep our sample features around to classify a new piece of cereal, unlike a method like linear regression where we can optimize our parameters and then throw the training data away.

What is the basic algorithm?

**We start with a matrix of features for our data. Let’s say we have a sample of m cereal pieces, each with n features, and so we have an n X m matrix. We want to:
1. Randomly initialize k centroids, which should be in our N-dimensional feature space. We can also randomly select k training samples
2. For each training sample, calculate the distance to each centroid, and assign it to the closest one
3. After each iteration through the training samples, re-calculate the centroid based on averaging the training points assigned to it
4. Repeat 2 and 3 until convergence - meaning that assignments stop changing, a certain number of iterations goes by, or some other criteria
You probably have many questions. How do we decide on a value of K? What about the distance metric? Are we going to find an optimal solution? You should definitely run K-means a few times, because the outcome can somewhat be determined by the initial centroids, and so you could converge to a local optima.

How do we choose our parameters?

First, you have to define your value of K, or how many clusters you believe there to be. This can be feasible if you have some domain knowledge about your problem (for example, I may know that I would typically keep 7-9 cereals in my cabinet, so I would want to try K = {7,8,9]) or it can be more challenging if you haven’t a clue. What is most commonly done is cross validation to find an optimal K, or basically creating a model for a set of Ks, and then choosing the one that works best. ”What works best” is defined by one of several common evaluation metrics for k-means clustering that get at the “goodness” of your clusters. For your distance metric, the most obvious distance metric to use is the Euclidian Distance, however you could use any metric that you like. Here are some common distance metrics, by courtesy of Matlab.

How do I evaluate my clusters?

Generally, there are two broad buckets of cluster evaluation: internal and external evaluation. Internal evaluation methods are specific to one clustering, while external evaluation methods try to compare across clusterings. A good clustering means that members of the same cluster are more similar to one another than to some member(s) of another cluster. We can assess this by comparing the distance of each point to its cluster centroid versus other cluster centroids. A good “tight” clustering has a small within cluster sum of squares, and a large between-cluster sum of squares. There is a nice visualization called a Silhouette Plot that calculates a score between -1 and 1 for the clustering, which could help you choose a good value of K. Lastly, Tibshirani’s Gap Statistic provides another nice visualization to help evaluate a set of clusterings.

What are some derivations?

If you are doing an analysis for which “soft” cluster assignment is appropriate (think of the margins as being fuzzy and allowing for a point to belong to more than one class) then you will want to read about Fuzzy C-Means clustering. If you restrict your centroids to actual points this is called k-medoids, or to medians (k-medians). There are also smarter ways to choose your initial centroids, such as the K-means++ algorithm.

What are some drawbacks to this approach?

We are making a big assumption about our clusters - that they are spherical blobs, and they are the same size. Other interesting patterns in the data that are not of this shape or equivalent size won’t be captured by this algorithm. At the end of the day, each cluster is represented by a single mean vector (of our features). If we want to classify a new point into a cluster, we calculate the distance of that point to all of our mean vectors, and assign it to the closest one.
·
Jun 20

Logistic Regression

Logistic Regression is a supervised learning method for predicting a discrete outcome (y) based on one or more features (x). If you remember in linear regression we were trying to predict a continuous variable, and in logistic regression we are trying to predict a discrete variable. The problem moves from one of prediction to one of classification.

Let’s start with a simple case where our y can only be 0 or 1, i.e., y = {0,1}. If I had a dataset with labels of 0 and 1 and I was looking to build a classifier, I would want my classifier to output a value between 0 and 1 that I could then threshold. A value of .5, for example, would be an obvious cutoff to indicate indifference between the two classes, however if I were building an ROC curve, I would vary my threshold between 0 and 1 to produce a nice curve that spans the tradeoff between calling everything 0, and calling everything 1. With this goal, logistic regression chooses to model the hypothesis with a distribution that is nicely, monotonically increasing between 0 and 1. This distribution is the sigmoid function!

Guess what? This is also called the logistic function, and it is defined as follows:

We can plug in the vector of coefficients multiplied by our x(i) values (this result represents our “prediction”) into this function to define our new hypothesis:

As before, our goal is to now find the parameters that maximize the likelihood of our data, and the likelihood is the probability of the outcomes (y) given the features (X), parameterized by theta. We re-write that expression in terms of each parameter, specifically we model the overall likelihood as the product of all density functions, the pdfs, for each x(i). We need to maximize this thing!

Well, crap. What in the world is that? In this case, we need to remember that we have a binary outcome, and so memories of the binomial distribution might be flying into our heads. Since there are only two outcomes in our sample space, the probability of one outcome is simply 1 - (the other one). In more concise terms:

Above we are defining the probability of y=1 given x as our hypothesis, and then the probability of 0 as… one minus that! We can write this more compactly as:

Try plugging in y = 0 or y = 1 into that equation - you get exactly what we defined for the probabilities of y = 1, and y = 0 in the previous step. Nice! Now let’s plug this into our Likelihood function:

and we need to maximize the above, which again is much easier if we take the log. It simplifies to this (remember that when taking the log of an exponent, you can move it in front of the log):

Again, we want to maximize this log of the likelihood. Let’s take the derivative for just one training example to get our “update rule” if we were doing stochastic gradient descent:

:O It’s exactly the same as the one from linear regression (the LMS update rule), but rest assured this is not the same algorithm, because our hypothesis is the sigmoid function… definitely not linear! So again, if we plug this into a loop and continue until convergence, that is how we find our optimal parameters.

Newton’s Method

Fig Newtons?

No, sorry… this is the person, Newton. He came up with a method for finding the minimum of a function that basically finds tangents along a curve, and jumps to where the tangent = 0 to calculate the next tangent. If you want more (graphical) information, please consult our friend, wikipedia. There is a nice animation there. Our update rule becomes:

If you want a closed form you will need the Hessian, or a matrix of second derivatives. But let’s be real, if you want to use this method, just code it up! Things make more sense in code than scripty letters.

The Perceptron Learning Algorithm

I should probably write a new post on this, but I want to note that if we change our g(z) definition from the sigmoid function to:

and use the same update rule, we get the Perceptron Learning Algorithm. I think that people used to think that this algorithm modeled how the brain works. We obviously know that it’s not quite that simple.
·