"In the last video, we talked about the process of evaluating an anomaly detection algorithm and there we started to use some labelled data, with examples that we knew were either anomalous or not anomalous, with y equals 1 or y equals 0." "So the question then arises, if we have this labeled data, we have some examples that are known to be anomalies and some that are known not to be not anomalies, why don't we just use a supervised learning algorithm," "so why don't we just use logistic regression or a neural network to try to learn directly from our labeled data, to predict whether y equals one or y equals zero." "In this video, I'll try to share with you some of the thinking and some guidelines for when you should probably use an anomaly detection algorithm and when it might be more fruitful to consider using a supervised learning algorithm." "This slide shows, what are the settings under which you should maybe use anomaly detection versus when supervised learning might be more fruitful." "If you have a problem with a very small number of positive examples, and remember examples of y equals one are the anomalous examples, then you might consider using an anomaly detection algorithm inset." "So having 0 to 20, maybe up to 50 positive examples, might be pretty typical, and usually, we have such a small set of positive examples, we are going to save the positive examples just for the cross" "validation sets and test sets." "In contrast, in a typical normal anomaly detection setting, we will often have a relatively large number of negative examples, of these normal examples of normal aircraft engines." "And we can then use this very large number of negative examples," "with which to fit the model p of x." "And so, there is this idea in many anomaly detection applications, you have very few positive examples, and lots of negative examples, and when we are doing the process of estimating p of x, of fitting all those Gaussian parameters," "we need only negative examples to do that." "So if you have a lot of negative data, we can still fit to p of x pretty well." "In contrast, for supervised learning, more typically we would have a reasonably large number of both positive and negative examples." "And so this is one way to look at your problem and decide if you should use an anomaly detection algorithm or a supervised learning algorithm." "Here is another way people often think about anomaly detection algorithms." "So, for anomaly detection applications often there are many different types of anomalies." "So think about aircraft engines." "You know there are so many different ways for aircraft engines to go wrong." "Right?" "There are so many things that could go wrong that could break an aircraft engine." "And so, if that's the case and you have a pretty small set of positive examples, then it can be difficult for an algorithm to learn from your small set of positive examples what the anomalies look like." "And in particular, you know, future anomalies may look nothing like the ones you've seen so far." "So maybe in your set of positive examples, maybe you had seen 5 or 10, or 20 different ways that an aircraft engine could go wrong." "But maybe tomorrow, you need to detect a totally new set, a totally new type of anomaly, a totally new way for an aircraft engine to be broken that you have just never seen before, and if that is the case," "then it might be more promising to just model the negative examples, with a sort of a Gaussian model" "P of X. Rather than try too hard to model the positive examples, because, you know, tomorrow's anomaly may be nothing like the ones you've seen so far." "In contrast, in some other problems you have enough positive examples for an algorithm to get a sense of what the positive examples are like." "And in particular, if you think that future positive examples are likely to be similar to ones in the training set, then in that setting it might be more reasonable to have a supervised learning algorithm, that looks at a lot of" "the positive examples, looks at a lot of the negative examples, and uses that to try to distinguish between positives and negatives." "So hopefully this gives you a sense of if you have a specific problem you should think about using the anomaly detection algorithm or a supervised learning algorithm." "And the key difference really is, that in anomaly detection, after we have such a small number of positive examples that there is not possible, for a learning algorithm to learn that much from the positive examples." "And so what we do instead, is take a large set of negative examples, and have it just learned a lot, learned p of x from just the negative examples of the normal aircraft engines, say." "And we reserve the small number of positive examples for evaluating our algorithm to use in either the cross validation sets or the test sets." "And just as a side comment about these many different types of anomalies, you know, in some earlier videos we talked about the email SPAM examples." "In those examples, there are actually many different types of SPAM email." "The SPAM email is trying to sell you things spam email, trying to steal your passwords, this is called fishing emails, and many different types of SPAM emails." "But for the SPAM problem, we usually have enough examples of spam email to see, you know, most of these different types of" "SPAM email, because we have a large set of examples of" "SPAM, and that's why we usually think of SPAM as a supervised learning setting, even though, you know, there may be many different types of SPAM." "And so, if we look at some applications of anomaly detection versus supervised learning, we'll find that, in fraud detection, if you have many different types of ways for people to try to commit fraud, and a" "relevantly small training set, a small number of fraudulent users on your website, then I would use an anomaly detection algorithm." "I should say, if you have, if you are very a major online retailer, and if you actually have had a lot of people try to commit fraud on your website, so if you actually have a lot of" "examples where y equals 1, then you know, sometimes fraud detection could actually shift over to the supervised learning column." "But, if you haven't seen that many examples of users doing strange things on your website then, more frequently, fraud detection is actually treated as an anomaly detection algorithm, rather than one of the supervised learning algorithm." "Other examples, we talked about manufacturing already, hopefully you'll see more normal examples, not that many anomalies." "But then again, for some manufacturing processes, if you're manufacturing very large volumes and you've seen a lot of bad examples, maybe manufacturing could shift to the supervised learning column as well." "But, if you haven't seen that many bad examples of the old products, then I'll do this anomaly detection." "Monitoring machines in the data center, again similar sorts of arguments apply." "Whereas, email SPAM classification, weather prediction, and classifying cancers, if you have equal numbers of positive and negative examples, a lot of you have many examples of your positive and your negative examples, then, we would tend to" "treat all of these as supervised learning problems." "So, hopefully, that gives you a sense of what are the properties of a learning problem that would cause you to treat it as an anomaly detention problem verses a supervised learning" "problem." "And for many of the problems that are faced by various technology companies and so on, we actually are in these settings where we have very few or sometimes zero positive training examples, maybe there are so many different types of anomalies that we've never" "seen them before, and for those sorts of problems, very often, the algorithm that is used is an anomaly detection algorithm." "By now you've seen all of the main pieces of the recommender system algorithm or the collaborative filtering algorithm." "In this video I want to just share one last implementational detail," "namely mean normalization, which can sometimes just make the algorithm work a little bit better." "To motivate the idea of mean normalization, let's consider an example of where there's a user that has not rated any movies." "So, in addition to our four users, Alice, Bob, Carol, and Dave, I've added a fifth user, Eve, who hasn't rated any movies." "Let's see what our collaborative filtering algorithm will do on this user." "Let's say that n is equal to 2 and so we're going to learn two features and we are going to have to learn a parameter vector theta" "5, which is going to be in R2, remember this is now vectors in Rn not Rn+1," "we'll learn the parameter vector theta 5 for our user number 5, Eve." "So if we look in the first term in this optimization objective, well the user Eve hasn't rated any movies, so there are no movies for which Rij is equal to one for the user Eve and so this first term plays no" "role at all in determining theta 5 because there are no movies that Eve has rated." "And so the only term that effects theta 5 is this term." "And so we're saying that we want to choose vector theta 5 so that the last regularization term is as small as possible." "In other words we want to minimize this lambda over 2 theta 5 subscript 1 squared" "plus theta 5 subscript 2 squared so that's the component of the regularization term that corresponds to user 5, and of course if your goal is to minimize this term, then what you're going to end up" "with is just theta 5 equals 0 0." "Because a regularization term is encouraging us to set parameters close to 0 and if there is no data to try to pull the parameters away from" "0, because this first term doesn't effect theta 5, we just end up with theta 5 equals the vector of all zeros." "And so when we go to predict how user 5 would rate any movie, we have that theta 5 transpose xi," "for any i, that's just going to be equal to zero." "Because theta 5 is 0 for any value of x, this inner product is going to be equal to 0." "And what we're going to have therefore, is that we're going to predict that Eve is going to rate every single movie with zero stars." "But this doesn't seem very useful does it?" "I mean if you look at the different movies," "Love at Last, this first movie, a couple people rated it 5 stars." "And for even the Swords vs. Karate, someone rated it 5 stars." "So some people do like some movies." "It seems not useful to just predict that Eve is going to rate everything 0 stars." "And in fact if we're predicting that eve is going to rate everything 0 stars, we also don't have any good way of recommending any movies to her, because you know all of these movies are getting exactly the" "same predicted rating for Eve so there's no one movie with a higher predicted rating that we could recommend to her, so, that's not very good." "The idea of mean normalization will let us fix this problem." "So here's how it works." "As before let me group all of my movie ratings into this matrix" "Y, so just take all of these ratings and group them into matrix Y. And this column over here of all question marks corresponds to" "Eve's not having rated any movies." "Now to perform mean normalization what I'm going to do is compute the average rating that each movie obtained." "And I'm going to store that in a vector that we'll call mu." "So the first movie got two 5-star and two 0-star ratings, so the average of that is a 2.5-star rating." "The second movie had an average of 2.5-stars and so on." "And the final movie that has 0, 0, 5, 0." "And the average of 0, 0, 5, 0, that averages out to an average of 1.25 rating." "And what I'm going to do is look at all the movie ratings and I'm going to subtract off the mean rating." "So this first element 5 I'm going to subtract off 2.5 and that gives me 2.5." "And the second element 5 subtract off of 2.5, get a 2.5." "And then the 0, 0, subtract off 2.5 and you get -2.5, -2.5." "In other words, what" "I'm going to do is take my matrix of movie ratings, take this wide matrix, and subtract form each row the average rating for that movie." "So, what I'm doing is just normalizing each movie to have an average rating of zero." "And so just one last example." "If you look at this last row, 0 0 5 0." "We're going to subtract 1.25, and so I end up with" "these values over here." "So now and of course the question marks stay a question" "mark." "So each movie in this new matrix Y has an average rating of 0." "What I'm going to do then, is take this set of ratings and use it with my collaborative filtering algorithm." "So I'm going to pretend that this was the data that I had gotten from my users, or pretend that these are the actual ratings I had gotten from the users, and I'm going to use this as my" "data set with which to learn my parameters theta" "J and my features XI" " from these mean normalized movie ratings." "When I want to make predictions of movie ratings, what I'm going to do is the following: for user J on movie" "I, I'm gonna predict theta" "J transpose XI, where" "X and theta are the parameters that I've learned from this mean normalized data set." "But, because on the data set, I had subtracted off the means in order to make a prediction on movie i," "I'm going to need to add back in the mean, and so i'm going to add back in mu i." "And so that's going to be my prediction where in my training data subtracted off all the means and so when we make predictions and we need to add back in these means mu i for movie i." "And so specifically if you user 5 which is Eve, the same argument as the previous slide still applies in the sense that Eve had not rated any movies and so the learned parameter for user 5 is still going to" "be equal to 0, 0." "And so what we're going to get then is that on a particular movie i we're going to predict for Eve theta" "5, transpose xi plus add back in mu i and so this first component is going to be equal to zero, if theta five is equal to zero." "And so on movie i, we are going to end a predicting mu i." "And, this actually makes sense." "It means that on movie 1 we're going to predict Eve rates it 2.5." "On movie 2 we're gonna predict Eve rates it 2.5." "On movie 3 we're gonna predict Eve rates it at 2 and so on." "This actually makes sense, because it says that if Eve hasn't rated any movies and we just don't know anything about this new user Eve, what we're going to do is just predict for each of the movies, what are" "the average rating that those movies got." "Finally, as an aside, in this video we talked about mean normalization, where we normalized each row of the matrix y," "to have mean 0." "In case you have some movies with no ratings, so it is analogous to a user who hasn't rated anything, but in case you have some movies with no ratings, you can also play with versions of the algorithm, where you" "normalize the different columns to have means zero, instead of normalizing the rows to have mean zero, although that's maybe less important, because if you really have a movie with no rating, maybe you just shouldn't recommend that movie to anyone, anyway." "And so, taking care of the case of a user who hasn't rated anything might be more important than taking care of the case of a movie that hasn't gotten a single rating." "So to summarize, that's how you can do mean normalization as a sort of pre-processing step for collaborative filtering." "Depending on your data set, this might some times make your implementation work just a little bit better." "In this video, I'd like to talk about how to initialize" "K-means and more importantly, this will lead into a discussion of how to make K-means avoid local optima as well." "Here's the K-means clustering algorithm that we talked about earlier." "One step that we never really talked much about was this step of how you randomly initialize the cluster centroids." "There are few different ways that one can imagine using to randomly initialize the cluster centroids." "But, it turns out that there is one method that is much more recommended than most of the other options one might think about." "So, let me tell you about that option since it's what often seems to work best." "Here's how I usually initialize my cluster centroids." "When running K-means, you should have the number of cluster centroids, K, set to be less than the number of training examples M. It would be really weird to run" "K-means with a number of cluster centroids that's, you know, equal or greater than the number of examples you have, right?" "So the way I usually initialize K-means is," "I would randomly pick k training examples." "So, and, what" "I do is then set Mu1 of MuK equal to these k examples." "Let me show you a concrete example." "Lets say that k is equal to 2 and so on this example on the right let's say I want to find two clusters." "So, what I'm going to do in order to initialize my cluster centroids is, I'm going to randomly pick a couple examples." "And let's say, I pick this one and I pick that one." "And the way I'm going to initialize my cluster centroids is, I'm just going to initialize" "my cluster centroids to be right on top of those examples." "So that's my first cluster centroid and that's my second cluster centroid, and that's one random initialization of K-means." "The one I drew looks like a particularly good one." "And sometimes I might get less lucky and maybe I'll end up picking that as my first random initial example, and that as my second one." "And here I'm picking two examples because k equals 2." "Some we have randomly picked two training examples and if" "I chose those two then I'll end up with, may be this as my first cluster centroid and that as my second initial location of the cluster centroid." "So, that's how you can randomly initialize the cluster centroids." "And so at initialization, your first cluster centroid Mu1 will be equal to x(i) for some randomly value of i and" "Mu2 will be equal to x(j) for some different randomly chosen value of j and so on, if you have more clusters and more cluster centroid." "And sort of the side common." "I should say that in the earlier video where I first illustrated K-means with the animation." "In that set of slides." "Only for the purpose of illustration." "I actually used a different method of initialization for my cluster centroids." "But the method described on this slide, this is really the recommended way." "And the way that you should probably use, when you implement K-means." "So, as they suggested perhaps by these two illustrations on the right." "You might really guess that K-means can end up converging to different solutions depending on exactly how the clusters were initialized, and so, depending on the random initialization." "K-means can end up at different solutions." "And, in particular, K-means can actually end up at local optima." "If you're given the data sale like this." "Well, it looks like, you know, there are three clusters, and so, if you run K-means and if it ends up at a good local optima this might be really the global optima, you might end up with that cluster ring." "But if you had a particularly unlucky, random initialization, K-means can also get stuck at different local optima." "So, in this example on the left it looks like this blue cluster has captured a lot of points of the left and then the they were on the green clusters each is captioned on the relatively small number of points." "And so, this corresponds to a bad local optima because it has basically taken these two clusters and used them into" "1 and furthermore, has split the second cluster into two separate sub-clusters like so, and it has also taken the second cluster and split it into two separate sub-clusters like so, and so, both of these examples on the lower" "right correspond to different local optima of K-means and in fact, in this example here, the cluster, the red cluster has captured only a single optima example." "And the term local optima, by the way, refers to local optima of this distortion function J, and what these solutions on the lower left, what these local optima correspond to is really solutions where K-means has gotten stuck to the local" "optima and it's not doing a very good job minimizing this distortion function J. So, if you're worried about K-means getting stuck in local optima, if you want to increase the odds of K-means finding the best" "possible clustering, like that shown on top here, what we can do, is try multiple, random initializations." "So, instead of just initializing K-means once and hopping that that works, what we can do is, initialize K-means lots of times and run K-means lots of times, and use that to try to make sure we get" "as good a solution, as good a local or global optima as possible." "Concretely, here's how you could go about doing that." "Let's say, I decide to run" "K-meanss a hundred times so I'll execute this loop a hundred times and it's fairly typical a number of times when came to will be something from 50 up to may be 1000." "So, let's say you decide to say K-means one hundred times." "So what that means is that we would randomnly initialize K-means." "And for each of these one hundred random intializations we would run K-means and that would give us a set of clusteringings, and a set of cluster centroids, and then we would then compute the distortion J, that is compute this cause function on" "the set of cluster assignments and cluster centroids that we got." "Finally, having done this whole procedure a hundred times." "You will have a hundred different ways of clustering the data and then finally what you do is all of these hundred ways you have found of clustering the data, just pick one, that gives us the lowest cost." "That gives us the lowest distortion." "And it turns out that if you are running K-means with a fairly small number of clusters , so you know if the number of clusters is anywhere from two up to maybe 10 - then doing multiple random initializations can often, can sometimes make" "sure that you find a better local optima." "Make sure you find the better clustering data." "But if K is very large, so, if" "K is much greater than 10, certainly if K were, you know, if you were trying to find hundreds of clusters, then," "having multiple random initializations is less likely to make a huge difference and there is a much higher chance that your first random initialization will give you a pretty decent solution already" "and doing, doing multiple random initializations will probably give you a slightly better solution but, but maybe not that much." "But it's really in the regime of where you have a relatively small number of clusters, especially if you have, maybe 2 or 3 or 4 clusters that random initialization could make a huge difference in terms of" "making sure you do a good job minimizing the distortion function and giving you a good clustering." "So, that's K-means with random initialization." "If you're trying to learn a clustering with a relatively small number of clusters, 2, 3," "4, 5, maybe, 6, 7, using multiple random initializations can sometimes, help you find much better clustering of the data." "But, even if you are learning a large number of clusters, the initialization, the random initialization method that I describe here." "That should give K-means a reasonable starting point to start from for finding a good set of clusters." "In this video I'd like to talk about one last detail of K-means clustering which is how to choose the number of clusters, or how to choose the value of the parameter" "capsule K. To be honest, there actually isn't a great way of answering this or doing this automatically and by far the most common way of choosing the number of clusters, is still choosing it manually by looking at visualizations or by" "looking at the output of the clustering algorithm or something else." "But I do get asked this question quite a lot of how do you choose the number of clusters, and so I just want to tell you know what are peoples' current thinking on it although, the most" "common thing is actually to choose the number of clusters by hand." "A large part of y, it might not always be easy to choose the number of clusters is that it is often generally ambiguous how many clusters there are in the data." "Looking at this data set some of you may see four clusters and that would suggest using K equals 4." "Or some of you may see two clusters and that will suggest K equals" "2 and now this may see three clusters." "And so, looking at the data set like this, the true number of clusters, it actually seems genuinely ambiguous to me, and I don't think there is one right answer." "And this is part of our supervised learning." "We are aren't given labels, and so there isn't always a clear cut answer." "And this is one of the things that makes it more difficult to say, have an automatic algorithm for choosing how many clusters to have." "When people talk about ways of choosing the number of clusters, one method that people sometimes talk about is something called the Elbow Method." "Let me just tell you a little bit about that, and then mention some of its advantages but also shortcomings." "So the Elbow Method, what we're going to do is vary" "K, which is the total number of clusters." "So, we're going to run K-means with one cluster, that means really, everything gets grouped into a single cluster and compute the cost function or compute the distortion" "J and plot that here." "And then we're going to run K means with two clusters, maybe with multiple random initial agents, maybe not." "But then, you know, with two clusters we should get, hopefully, a smaller distortion," "and so plot that there." "And then run K-means with three clusters, hopefully, you get even smaller distortion and plot that there." "I'm gonna run K-means with four, five and so on." "And so we end up with a curve showing how the distortion, you know, goes down as we increase the number of clusters." "And so we get a curve that maybe looks like this." "And if you look at this curve, what the Elbow Method does it says "Well, let's look at this plot." "Looks like there's a clear elbow there"." "Right, this is, would be by analogy to the human arm where, you know, if you imagine that you reach out your arm, then, this is your shoulder joint, this is your elbow joint and I guess, your hand is at the end over here." "And so this is the Elbow Method." "Then you find this sort of pattern where the distortion goes down rapidly from 1 to 2, and 2 to" "3, and then you reach an elbow at 3, and then the distortion goes down very slowly after that." "And then it looks like, you know what, maybe using three clusters is the right number of clusters, because that's the elbow of this curve, right?" "That it goes down, distortion goes down rapidly until K equals" "3, really goes down very slowly after that." "So let's pick K equals 3." "If you apply the Elbow Method, and if you get a plot that actually looks like this, then, that's pretty good, and this would be a reasonable way of choosing the number of clusters." "It turns out the Elbow Method isn't used that often, and one reason is that, if you actually use this on a clustering problem, it turns out that fairly often, you know, you end up with a curve" "that looks much more ambiguous, maybe something like this." "And if you look at this," "I don't know, maybe there's no clear elbow, but it looks like distortion continuously goes down, maybe 3 is a good number, maybe 4 is a good number, maybe 5 is also not bad." "And so, if you actually do this in a practice, you know, if your plot looks like the one on the left and that's great." "It gives you a clear answer, but just as often, you end up with a plot that looks like the one on the right and is not clear where the ready location of the elbow is." "It makes it harder to choose a number of clusters using this method." "So maybe the quick summary of the Elbow Method is that is worth the shot but I wouldn't necessarily," "you know, have a very high expectation of it working for any particular problem." "Finally, here's one other way of how, thinking about how you choose the value of K, very often people are running" "K-means in order you get clusters for some later purpose, or for some sort of downstream purpose." "Maybe you want to use K-means in order to do market segmentation, like in the T-shirt sizing example that we talked about." "Maybe you want K-means to organize a computer cluster better, or maybe a learning cluster for some different purpose, and so, if that later, downstream purpose, such as market segmentation." "If that gives you an evaluation metric, then often, a better way to determine the number of clusters, is to see how well different numbers of clusters serve that later downstream purpose." "Let me step through a specific example." "Let me go through the T-shirt size example again, and I'm trying to decide, do I want three T-shirt sizes?" "So, I choose K equals 3, then" "I might have small, medium and large T-shirts." "Or maybe, I want to choose" "K equals 5, and then I might have, you know, extra small, small, medium, large and extra large T-shirt sizes." "So, you can have like 3 T-shirt sizes or four or five T-shirt sizes." "We could also have four T-shirt sizes, but I'm just showing three and five here, just to simplify this slide for now." "So, if I run K-means with" "K equals 3, maybe I end up with, that's my small" "and that's my medium and that's my large." "Whereas, if I run K-means with 5 clusters, maybe I end up with, those are my extra small T-shirts, these are my small, these are my medium, these are my large and these are my extra large." "And the nice thing about this example is that, this then maybe gives us another way to choose whether we want" "3 or 4 or 5 clusters, and in particular, what you can do is, you know, think about this from the perspective of the T-shirt business and ask: "Well if I have five segments, then how well" "will my T-shirts fit my customers and so, how many T-shirts can I sell?" "How happy will my customers be?"" "What really makes sense, from the perspective of the T-shirt business, in terms of whether, I want to have Goer T-shirt sizes so that my T-shirts fit my customers better." "Or do I want to have fewer" "T-shirt sizes so that" "I make fewer sizes of T-shirts." "And I can sell them to the customers more cheaply." "And so, the t-shirt selling business, that might give you a way to decide, between three clusters versus five clusters." "So, that gives you an example of how a later downstream purpose like the problem of deciding what" "T-shirts to manufacture, how that can give you an evaluation metric for choosing the number of clusters." "For those of you that are doing the program exercises, if you look at this week's program exercise associative K-means, that's an example there of using K-means for image compression." "And so if you were trying to choose how many clusters to use for that problem, you could also, again use the evaluation metric of image compression to choose the number of clusters, K?" "So, how good do you want the image to look versus, how much do you want to compress the file size of the image, and, you know, if you do the programming exercise, what I've just" "said will make more sense at that time." "So, just summarize, for the most part, the number of customers K is still chosen by hand by human input or human insight." "One way to try to do so is to use the Elbow Method, but I wouldn't always expect that to work well, but I think the better way to think about how to choose the number of clusters is to ask, for" "what purpose are you running K-means?" "And then to think, what is the number of clusters K that serves that, you know, whatever later purpose that you actually run the K-means for." "In this video, I'd like to start talking about a second type of unsupervised learning problem called dimensionality reduction." "There are a couple of different reasons why one might want to do dimensionality reduction." "One is data compression, and as we'll see later, a few videos later, data compression not only allows us to compress the data and have it therefore use up less computer memory or disk space, but it will" "also allow us to speed up our learning algorithms." "But first, let's start by talking about what is dimensionality reduction." "As a motivating example, let's say that we've collected a data set with many, many, many features, and I've plotted just two of them here." "And let's say that unknown to us two of the features were actually the length of something in centimeters, and a different feature, x2, is the length of the same thing in inches." "So, this gives us a highly redundant representation and maybe instead of having two separate features x1 then x2, both of which basically measure the length, maybe what we want to do is reduce the data to one-dimensional and" "just have one number measuring this length." "In case this example seems a bit contrived, this centimeter and inches example is actually not that unrealistic, and not that different from things that I see happening in industry." "If you have hundreds or thousands of features, it is often this easy to lose track of exactly what features you have." "And sometimes may have a few different engineering teams, maybe one engineering team gives you two hundred features, a second engineering team gives you another three hundred features, and a third engineering team gives you five hundred features so you have" "a thousand features all together, and it actually becomes hard to keep track of you know, exactly which features you got from which team, and it's actually not that want to have highly redundant features like these." "And so if the length in centimeters were rounded off to the nearest centimeter and lengthened inches was rounded off to the nearest inch." "Then, that's why these examples don't lie perfectly on a straight line, because of, you know, round-off error to the nearest centimeter or the nearest inch." "And if we can reduce the data to one dimension instead of two dimensions, that reduces the redundancy." "For a different example, again maybe when there seems fairly less contrives." "For may years I've been working with autonomous helicopter pilots." "Or I've been working with pilots that fly helicopters." "And so." "If you were to measure--if you were to, you know, do a survey or do a test of these different pilots--you might have one feature, x1, which is maybe the skill of these helicopter pilots, and maybe" ""x2" could be the pilot enjoyment." "That is, you know, how much they enjoy flying, and maybe these two features will be highly correlated." "And what you really care about might be this sort of" "this sort of, this direction, a different feature that really measures pilot aptitude." "And I'm making up the name aptitude of course, but again, if you highly correlated features, maybe you really want to reduce the dimension." "So, let me say a little bit more about what it really means to reduce the dimension of the data from" "2 dimensions down from 2D to 1 dimensional or to 1D." "Let me color in these examples by using different" "colors." "And in this case by reducing the dimension what" "I mean is that I would like to find maybe this line, this, you know, direction on which most of the data seems to lie and project all the data onto that line which is true, and by doing" "so, what I can do is just measure the position of each of the examples on that line." "And what I can do is come up with a new feature, z1," "and to specify the position on the line I need only one number, so it says z1 is a new feature that specifies the location of each of those points on this green line." "And what this means, is that where as previously if i had an example x1, maybe this was my first example, x1." "So in order to represent x1 originally x1." "I needed a two dimensional number, or a two dimensional feature vector." "Instead now I can represent" "z1." "I could use just z1 to represent my first example, and that's going to be a real number." "And similarly x2 you know, if x2 is my second example there," "then previously, whereas this required two numbers to represent if I instead compute the projection of that black cross onto the line." "And now I only need one real number which is z2 to represent the location of this point z2 on the line." "And so on through my M examples." "So, just to summarize, if we allow ourselves to approximate" "the original data set by projecting all of my original examples onto this green line over here, then I need only one number, I need only real number to specify the position of a point on the line," "and so what I can do is therefore use just one number to represent the location of each of my training examples after they've been projected onto that green line." "So this is an approximation to the original training self because" "I have projected all of my training examples onto a line." "But now, I need to keep around only one number for each of my examples." "And so this halves the memory requirement, or a space requirement, or what have you, for how to store my data." "And perhaps more interestingly, more importantly, what we'll see later, in the later video as well is that this will allow us to make our learning algorithms run more quickly as well." "And that is actually, perhaps, even the more interesting application of this data compression rather than reducing the memory or disk space requirement for storing the data." "On the previous slide we showed an example of reducing data from 2D to 1D." "On this slide, I'm going to show another example of reducing data from three dimensional 3D to two dimensional 2D." "By the way, in the more typical example of dimensionality reduction we might have a thousand dimensional data or 1000D data that we might want to reduce to let's say a hundred dimensional or" "100D, but because of the limitations of what I can plot on the slide." "I'm going to use examples of 3D to 2D, or 2D to 1D." "So, let's have a data set like that shown here." "And so, I would have a set of examples x(i) which are points in r3." "So, I have three dimension examples." "I know it might be a little bit hard to see this on the slide, but I'll show a 3D point cloud in a little bit." "And it might be hard to see here, but all of this data maybe lies roughly on the plane, like so." "And so what we can do with dimensionality reduction, is take all of this data and project the data down onto a two dimensional plane." "So, here what I've done is," "I've taken all the data and I've projected all of the data, so that it all lies on the plane." "Now, finally, in order to specify the location of a point within a plane, we need two numbers, right?" "We need to, maybe, specify the location of a point along this axis, and then also specify it's location along that axis." "So, we need two numbers, maybe called z1 and z2 to specify the location of a point within a plane." "And so, what that means, is that we can now represent each example, each training example, using two numbers that" "I've drawn here, z1, and z2." "So, our data can be represented using vector z which are in r2." "And these subscript, z subscript 1, z subscript 2, what" "I just mean by that is that my vectors here, z, you know, are two dimensional vectors, z1, z2." "And so if I have some particular examples, z(i), or that's the two dimensional vector, z(i)1, z(i)2." "And on the previous slide when" "I was reducing data to one dimensional data then I had only z1, right?" "And that is what a z1 subscript 1 on the previous slide was, but here I have two dimensional data, so I have z1 and z2 as the two components of the data." "Now, let me just make sure that these figures make sense." "So let me just reshow these exact three figures again but with 3D plots." "So the process we went through was that shown in the lab is the optimal data set, in the middle the data set projects on the 2D, and on the right the 2D data sets with z1 and z2 as the axis." "Let's look at them a little bit further." "Here's my original data set, shown on the left, and so I had started off with a 3D point cloud like so, where the axis are labeled x1, x2, x3, and so there's a 3D" "point but most of the data, maybe roughly lies on some, you know, not too far from some 2D plain." "So, what we can do is take this data and here's my middle figure." "I'm going to project it onto 2D." "So, I've projected this data so that all of it now lies on this 2D surface." "As you can see all the data lies on a plane, 'cause we've projected everything onto a plane, and so what this means is that now I need only two numbers, z1 and z2, to represent" "the location of point on the plane." "And so that's the process that we can go through to reduce our data from three dimensional to two dimensional." "So that's dimensionality reduction and how we can use it to compress our data." "And as we'll see later this will allow us to make some of our learning algorithms run much later as well, but we'll get to that only in a later video." "In the last video, we talked about dimensionality reduction for the purpose of compressing the data." "In this video, I'd like to tell you about a second application of dimensionality reduction and that is to visualize the data." "For a lot of machine learning applications, it really helps us to develop effective learning algorithms, if we can understand our data better." "If there is some way of visualizing the data better, and so, dimensionality reduction offers us, often, another useful tool to do so." "Let's start with an example." "Let's say we've collected a large data set of many statistics and facts about different countries around the world." "So, maybe the first feature, X1 is the country's GDP, or the" "Gross Domestic Product, and" "X2 is a per capita, meaning the per person GDP, X3 human development index, life expectancy, X5, X6 and so on." "And we may have a huge data set like this, where, you know, maybe 50 features for every country, and we have a huge set of countries." "So is there something we can do to try to understand our data better?" "I've given this huge table of numbers." "How do you visualize this data?" "If you have 50 features, it's very difficult to plot 50-dimensional" "data." "What is a good way to examine this data?" "Using dimensionality reduction, what we can do is, instead of having each country represented by this featured vector, xi, which is 50-dimensional, so instead of, say, having a country like Canada, instead of" "having 50 numbers to represent the features of Canada, let's say we can come up with a different feature representation that is these z vectors, that is in R2." "If that's the case, if we can have just a pair of numbers, z1 and z2 that somehow, summarizes my 50 numbers, maybe what we can do [xx] is to plot these countries in R2 and" "use that to try to understand the space in [xx] of features of different countries [xx] the better and so, here, what you can do is reduce the data from 50" "D, from 50 dimensions to 2D, so you can plot this as a 2 dimensional plot, and, when you do that, it turns out that, if you look at the output of the Dimensionality Reduction algorithms," "It usually doesn't astride a physical meaning to these new features you want [xx] to." "It's often up to us to figure out you know, roughly what these features means." "But, And if you plot those features, here is what you might find." "So, here, every country is represented by a point" "ZI, which is an R2 and so each of those." "Dots, and this figure represents a country, and so, here's Z1 and here's" "Z2, and [xx] [xx] of these." "So, you might find, for example, That the horizontial axis the Z1 axis" "corresponds roughly to the overall country size, or the overall economic activity of a country." "So the overall GDP, overall economic size of a country." "Whereas the vertical axis in our data might correspond to the per person GDP." "Or the per person well being, or the per person economic activity, and, you might find that, given these" "50 features, you know, these are really the 2 main dimensions of the deviation, and so, out here you may have a country like the U.S.A., which is a relatively large GDP, you know, is a very" "large GDP and a relatively high per-person GDP as well." "Whereas here you might have a country like Singapore, which actually has a very high per person GDP as well, but because Singapore is a much smaller country the overall" "economy size of Singapore is much smaller than the US." "And, over here, you would have countries where individuals" "are unfortunately some are less well off, maybe shorter life expectancy," "less health care, less economic maturity that's why smaller countries, whereas a point like this will correspond to a country that has a fair, has a substantial amount of economic activity, but where individuals tend to be somewhat less well off." "So you might find that the axes Z1 and Z2 can help you to most succinctly capture really what are the two main dimensions of the variations" "amongst different countries." "Such as the overall economic activity of the country projected by the size of the country's overall economy as well as the per-person individual well-being, measured by per-person" "GDP, per-person healthcare, and things like that." "So that's how you can use dimensionality reduction, in order to reduce data from" "50 dimensions or whatever, down to two dimensions, or maybe down to three dimensions, so that you can plot it and understand your data better." "In the next video, we'll start to develop a specific algorithm, called PCA, or Principal Component" "Analysis, which will allow us to do this and also" "do the earlier application I talked about of compressing the data." "For the problem of dimensionality reduction, by far the most popular, by far the most commonly used algorithm is something called principal components analysis or PCA." "It In this video, I would like to start talking about the problem formulation for PCA, in other words let us try to formulate precisely exactly what we would like PCA to do." "Let's say we have a dataset like this, so this is a dataset of example X in R2, and let's say I want to reduce the dimension of the data from two dimensional to one dimensional." "In other words I would like to find a line onto which to project the data." "So what seems like a good line onto which to project the data?" "A line like this might be a pretty good choice." "And the reason you think this might be a good choice is that if you look at where the projected versions of the points goes." "I'm gonna take this point and project it down here and get that." "This point gets projected here, to here, to here, to here what we find is that the distance between each point and the projected version is pretty small." "That is, these blue line segments are pretty short." "So what PCA does, formally, is it tries to find a lower-dimensional surface really a line in this case, onto which to project the data, so that the sum of squares of these little blue line segments is minimized." "The length of those blue line segments, that's sometimes also called the projection error, and so what PCA does is it tries to find the surface onto which to project the data so as to minimize that." "As an a side, before applying PCA it's standard practice to first perform mean normalization and feature scaling so that the features, X1 and X2, should have zero mean and should have comparable ranges of values." "I've already done this for this example, but I'll come back to this later and talk more about feature scaling and mean normalization in the context of PCA later." "Coming back to this example, in contrast to the red lines that I just drew here's a different line onto which I could project my data." "This magenta line." "And as you can see, you know, this magenta line is a much worse direction onto which to project my data, right?" "So if I were to project my data onto the magenta line, like the other set of points like that." "And the projection errors, that is these blue line segments would be huge." "So these points have to move a huge distance in order to get onto the, in order to get projected onto the magenta line." "And so that's why PCA principle component analysis would choose something like the red line rather than like the magenta line down here." "Let's write out the PCA problem a little more formally." "The goal of PCA if we want to reduce data from two- dimensional to one-dimensional is we're going to try to find a vector, that is a vector Ui which is going to be in Rn, so that would be in R2 in this case," "going to find direction onto which to project the data so as to minimize the projection error." "So in this example I'm hoping that PCA will find this vector, which I'm going to call U1, so that when I project the data onto" "the line that I defined by extending out this vector," "I end up with pretty small reconstruction errors and reference data looks like this." "And by the way," "I should mention that whether PCA gives me U1 or negative U1, it doesn't matter." "So if it gives me a positive vector in this direction that's fine, if it gives me, sort of the opposite vector facing in the opposite direction, so that would be" "U1, draw that in blue instead, whether it gives me positive" "U1 negative U1, it doesn't matter, because each of these vectors defines the same red line onto which" "I'm projecting my data." "So this is a case of reducing data from 2 dimensional to 1 dimensional." "In the more general case we have N dimensional data and we want to reduce it K dimensions." "In that case, we want to find not just a single vector onto which to project the data but we want to find K dimensions onto which to project the data." "So as to minimize this projection error." "So here's an example." "If I have a 3D point cloud like this then maybe what I want to do is find" "vectors, sorry find a pair of vectors, and I'm gonna call these vectors lets draw these in red, I'm going to find a pair of vectors, extending for the origin here's" "U1, and here's my second vector U2, and together these two vectors define a plane, or they define a 2D surface," "kind of like this, sort of, 2D surface, onto which I'm going to project my data." "For those of you that are familiar with linear algebra, for those of you that are really experts in linear algebra, the formal definition of this is that we're going to find a set of vectors, U1 U2 maybe up" "to Uk, and what we're going to do is project the data onto the linear subspace spanned by this set of k vectors." "But if you are not familiar with linear algebra, just think of it as finding k directions instead of just one direction, onto which to project the data." "So, finding a k-dimensional surface, really finding a 2D plane in this case, shown in this figure, where we can" "define the position of the points in the plane using k directions." "That's why for PCA, we want to find k vectors onto which to project the data." "And so, more formally, in" "PCA what we want to do is find this way to project the data so as to minimize the, sort of, projection distance, which is distance between points and projections." "In this so 3D example, too, given a point we would take the point and project it onto this 2D surface." "When you're done with that, and so the projection error would be, you know, the distance between the point and where it gets projected down." "to my 2D surface, and so what PCA does is it'll try to find a line or plane or whatever onto which to project the data, to try to minimize that squared projection, that 90 degree, or that orthogonal projection error." "Finally, one question that I sometimes get asked is how does PCA relate to linear regression, because when explaining" "PCA I sometimes end up drawing diagrams like these and that looks a little bit like linear regression." "It turns out PCA is not linear regression, and despite some cosmetic similarity these are actually totally different algorithms." "If we were doing linear regression what we would do would be, on the left we would be trying to predict the value of some variable y given some input features x." "And so linear regression, what we're doing is we're fitting a straight line" "so as to minimize the squared error between a point and the straight line." "And so what we'd be minimizing would be the squared magnitude of these blue lines." "And notice I'm drawing these blue lines vertically, that they are the vertical distance between a point and the value predicted by the hypothesis, whereas in contrast, in PCA, what it does is it tries to" "minimize the magnitude of these blue lines, which are drawn at an angle, these are really the shortest orthogonal distances, the shortest distance between the point X and this red line, and this gives very different effects, depending" "on the data set." "And more generally generally and more generally when you're doing linear regression there is this distinguished variable y that we're trying to predict, all that linear regression is about is taking all the values of X and try to use" "that to predict Y. Whereas in PCA, there is no distinguished or there is no special variable Y that we're trying to predict, and instead we have a list of features x1, x2, and so on" "up to xN, and all of these features are treated equally." "So, no one of them is special." "As one last example, if I have three-dimensional data, and" "I want to reduce data from 3D to 2D, so maybe" "I want to find two directions, you know, u1, and u2, onto which to project my data, then what I have is I have three features, x1, x2, x3, and all of these are treated alike." "All of these are treated symmetrically and there is no special variable y that I'm trying to predict." "And so PCA is not linear regression, and even though at some cosmetic level they might look related, these are actually very different algorithms." "So, hopefully you now understand what" "PCA is doing: it's trying to find a lower dimensional surface onto which to project the data so as to minimize this squared projection error, to minimize the squared distance between each point and the location of where it gets projected." "In the next video we'll start to talk about how to actually find this lower dimensional surface onto which to project the data." "In this video I'd like to tell you about the principle components analysis algorithm." "And by the end of this video you know to implement PCA for yourself." "And use it reduce the dimension of your data." "Before applying PCA, there is a data pre-processing step which you should always do." "Given the trading sets of the examples is important to always perform mean normalization," "and then depending on your data, maybe perform feature scaling as well." "this is very similar to the mean normalization and feature scaling process that we have for supervised learning." "In fact it's exactly the same procedure except that we're doing it now to our unlabeled" "data, X1 through Xm." "So for mean normalization we first compute the mean of each feature and then we replace each feature, X, with X minus its mean, and so this makes each feature now have exactly zero mean" "The different features have very different scales." "So for example, if x1 is the size of a house, and x2 is the number of bedrooms, to use our earlier example, we then also scale each feature to have a comparable range of values." "And so, similar to what we had with supervised learning, we would take x, i substitute j, that's the j feature" "and so we would subtract of the mean, now that's what we have on top, and then divide by sj." "Here, sj is some measure of the beta values of feature j." "So, it could be the max minus min value, or more commonly, it is the standard deviation of feature j." "Having done this sort of data pre-processing, here's what the PCA algorithm does." "We saw from the previous video that what PCA does is, it tries to find a lower dimensional sub-space onto which to project the data, so as to minimize the squared projection errors, sum of the squared projection errors, as the" "square of the length of those blue lines that and so what we wanted to do specifically is find a vector, u1, which specifies that direction or in the 2D case we want to find two vectors, u1 and" "u2, to define this surface onto which to project the data." "So, just as a quick reminder of what reducing the dimension of the data means, for this example on the left we were given the examples xI, which are in r2." "And what we like to do is find a set of numbers zI in r push to represent our data." "So that's what from reduction from 2D to 1D means." "So specifically by projecting data onto this red line there." "We need only one number to specify the position of the points on the line." "So i'm going to call that number z or z1." "Z here [xx] real number, so that's like a one dimensional vector." "So z1 just refers to the first component of this, you know, one by one matrix, or this one dimensional vector." "And so we need only one number to specify the position of a point." "So if this example here was my example" "X1, then maybe that gets mapped here." "And if this example was X2 maybe that example gets mapped" "And so this point here will be Z1 and this point here will be" "Z2, and similarly we would have those other points for These, maybe X3," "X4, X5 get mapped to Z1, Z2, Z3." "So What PCA has to do is we need to come up with a way to compute two things." "One is to compute these vectors, u1, and in this case u1 and u2." "And the other is how do we compute these numbers," "Z. So on the example on the left we're reducing the data from 2D to 1D." "In the example on the right, we would be reducing data from" "3 dimensional as in r3, to zi, which is now two dimensional." "So these z vectors would now be two dimensional." "So it would be z1 z2 like so, and so we need to give away to compute these new representations, the z1 and z2 of the data as well." "So how do you compute all of these quantities?" "It turns out that a mathematical derivation, also the mathematical proof, for what is the right value U1, U2, Z1," "Z2, and so on." "That mathematical proof is very complicated and beyond the scope of the course." "But once you've done [xx] it turns out that the procedure to actually find the value of u1 that you want is not that hard, even though so that the mathematical proof that this value is the correct" "value is someone more involved and more than i want to get into." "But let me just describe the specific procedure that you have to implement in order to compute all of these things, the vectors, u1, u2," "the vector z." "Here's the procedure." "Let's say we want to reduce the data to n dimensions to k dimension What we're going to do is first compute something called the covariance matrix, and the covariance matrix is commonly denoted by this Greek alphabet which is" "the capital Greek alphabet sigma." "It's a bit unfortunate that the" "Greek alphabet sigma looks exactly like the summation symbols." "So this is the" "Greek alphabet Sigma is used to denote a matrix and this here is a summation symbol." "So hopefully in these slides there won't be ambiguity about which is Sigma Matrix, the matrix, which is a summation symbol, and hopefully it will be clear from context when" "I'm using each one." "How do you compute this matrix let's say we want to store it in an octave variable called sigma." "What we need to do is compute something called the eigenvectors of the matrix sigma." "And an octave, the way you do that is you use this command, u s v equals s v d of sigma." "SVD, by the way, stands for singular value decomposition." "This is a Much more advanced single value composition." "It is much more advanced linear algebra than you actually need to know but now It turns out that when sigma is equal to matrix there is a few ways to compute these are high in vectors and If you are an expert in linear algebra" "and if you've heard of high in vectors before you may know that there is another octet function called I, which can also be used to compute the same thing." "and It turns out that the" "SVD function and the" "I function it will give you the same vectors, although SVD is a little more numerically stable." "So I tend to use SVD, although" "I have a few friends that use the I function to do this as wellbut when you apply this to a covariance matrix sigma it gives you the same thing." "This is because the covariance matrix always satisfies a mathematical" "Property called symmetric positive definite" "You really don't need to know what that means, but the SVD" "and I-functions are different functions but when they are applied to a covariance matrix which can be proved to always satisfy this" "mathematical property; they'll always give you the same thing." "Okay, that was probably much more linear algebra than you needed to know." "In case none of that made sense, don't worry about it." "All you need to know is that this system command you should implement in Octave." "And if you're implementing this in a different language than Octave or MATLAB, what you should do is find the numerical linear algebra library that can compute the SVD or singular value decomposition, and there are many such libraries for" "probably all of the major programming languages." "People can use that to compute the matrices u, s, and d of the covariance matrix sigma." "So just to fill in some more details, this covariance matrix sigma will be an n by n matrix." "And one way to see that is if you look at the definition" "this is an n by 1 vector and this here I transpose is" "1 by N so the product of these two things is going to be an N by N matrix." "1xN transfers, 1xN, so there's an NxN matrix and when we add up all of these you still have an NxN matrix." "And what the SVD outputs three matrices, u, s, and v. The thing you really need out of the SVD is the u matrix." "The u matrix will also be a NxN matrix." "And if we look at the columns of the U matrix it turns out that the columns" "of the U matrix will be exactly those vectors, u1, u2 and so on." "So u, will be matrix." "And if we want to reduce the data from n dimensions down to k dimensions, then what we need to do is take the first k vectors." "that gives us u1 up to uk which gives us the K direction onto which we want to project the data." "the rest of the procedure from this SVD numerical linear algebra routine we get this matrix u." "We'll call these columns u1-uN." "So, just to wrap up the description of the rest of the procedure, from the SVD numerical linear algebra routine we get these matrices u, s, and d. we're going to use the first K columns" "of this matrix to get u1-uK." "Now the other thing we need to is take my original data set, X which is an RN And find a lower dimensional representation Z, which is a R K for this data." "So the way we're going to do that is take the first K Columns of the U matrix." "Construct this matrix." "Stack up U1, U2 and so on up to U K in columns." "It's really basically taking, you know, this part of the matrix, the first K columns of this matrix." "And so this is going to be an N by K matrix." "I'm going to give this matrix a name." "I'm going to call this matrix" "U, subscript "reduce," sort of a reduced version of the U matrix maybe." "I'm going to use it to reduce the dimension of my data." "And the way I'm going to compute Z is going to let Z be equal to this" "U reduce matrix transpose times" "X. Or alternatively, you know, to write down what this transpose means." "When I take this transpose of this U matrix, what I'm going to end up with is these vectors now in rows." "I have U1 transpose down to UK transpose." "Then take that times X, and that's how I get my vector Z. Just to make sure that these dimensions make sense," "this matrix here is going to be k by n and x here is going to be n by 1 and so the product here will be k by 1." "And so z is k dimensional, is a k dimensional vector, which is exactly what we wanted." "And of course these x's here right, can be Examples in our training set can be examples in our cross validation set, can be examples in our test set, and for example if you know," "I wanted to take training example i," "I can write this as xi" "XI and that's what will give me ZI over there." "So, to summarize, here's the" "PCA algorithm on one slide." "After mean normalization, to ensure that every feature is zero mean and optional feature scaling whichYou really should do feature scaling if your features take on very different ranges of values." "After this pre-processing we compute the carrier matrix Sigma like so by the way if your data is given as a matrix like hits if you have your data Given in rows like this." "If you have a matrix X which is your time trading sets written in rows where x1 transpose down to x1 transpose," "this covariance matrix sigma actually has a nice vectorizing implementation." "You can implement in octave, you can even run sigma equals 1 over m, times x, which is this matrix up here, transpose times x and this simple expression, that's the vectorize implementation of how" "to compute the matrix sigma." "I'm not going to prove that today." "This is the correct vectorization whether you want, you can either numerically test this on yourself by trying out an octave and making sure that both this and this implementations give the same answers or you Can try to prove it yourself mathematically." "Either way but this is the correct vectorizing implementation, without compusingnext" "we can apply the SVD routine to get u, s, and d." "And then we grab the first k columns of the u matrix you reduce and finally this defines how we go from a feature vector x to this reduce dimension representation z." "And similar to k Means if you're apply PCA, they way you'd apply this is with vectors X and RN." "So, this is not done with X-0 1." "So that was the PCA algorithm." "One thing I didn't do is give a mathematical proof that this There it actually give the projection of the data onto the K dimensional subspace onto the" "K dimensional surface that actually minimizes the square projection error Proof of that is beyond the scope of this course." "Fortunately the PCA algorithm can be implemented in not too many lines of code." "and if you implement this in octave or algorithm, you actually get a very effective dimensionality reduction algorithm." "So, that was the PCA algorithm." "One thing I didn't do was give a mathematical proof that the U1 and U2 and so on and the Z and so on you get out of this procedure is really the choices that would minimize these squared projection error." "Right, remember we said What" "PCA tries to do is try to find a surface or line onto which to project the data so as to minimize to square projection error." "So I didn't prove that this that, and the mathematical proof of that is beyond the scope of this course." "But fortunately the PCA algorithm can be implemented in not too many lines of octave code." "And if you implement this, this is actually what will work, or this will work well, and if you implement this algorithm, you get a very effective dimensionality reduction algorithm." "That does do the right thing of minimizing this square projection error." "In the PCA algorithm we take" "N dimensional features and reduce them to some K dimensional feature representation." "This number K is a parameter of the PCA algorithm." "This number K is also called the number of principle components or the number of principle components that we've retained." "And in this video I'd like to give you some guidelines, tell you about how people tend to think about how to choose this parameter K for PCA." "In order to choose k, that is to choose the number of principal components, here are a couple of useful concepts." "What PCA tries to do is it tries to minimize" "the average squared projection error." "So it tries to minimize this quantity, which I'm writing down, which is the difference between the original data X and the projected version, X-approx-i, which was defined last video, so it tries to minimize the squared" "distance between x and it's projection onto that lower dimensional surface." "So that's the average square projection error." "Also let me define the total variation in the data to be the average length squared of these examples Xi so the total variation in the data is the average of my training sets of the length of each of my training examples." "And this one says, "On average, how far are my training examples from the vector, from just being all zeros?"" "How far is, how far on average are my training examples from the origin?" "When we're trying to choose k, a pretty common rule of thumb for choosing k is to choose the smaller values so that the ratio between these is less than 0.01." "So in other words, a pretty common way to think about how we choose k is we want the average squared projection error." "That is the average distance between x and it's projections" "divided by the total variation of the data." "That is how much the data varies." "We want this ratio to be less than, let's say, 0.01." "Or to be less than 1%, which is another way of thinking about it." "And the way most people think about choosing K is rather than choosing K directly the way most people talk about it is as what this number is, whether it is 0.01 or some other number." "And if it is 0.01, another way to say this to use the language of PCA is that 99% of the variance is retained." "I don't really want to, don't worry about what this phrase really means technically but this phrase "99% of variance is retained" just means that this quantity on the left is less than 0.01." "And so, if you are using PCA and if you want to tell someone, you know, how many principle components you've retained it would be more common to say well, I chose k so that 99% of the variance was retained." "And that's kind of a useful thing to know, it means that you know, the average squared projection error divided by the total variation that was at most 1%." "That's kind of an insightful thing to think about, whereas if you tell someone that, "Well I had to 100 principle components" or "k was equal to 100 in a thousand dimensional data" it's a little" "hard for people to interpret that." "So this number 0.01 is what people often use." "Other common values is 0.05, and so this would be 5%, and if you do that then you go and say well 95% of the variance is retained and, you know other numbers maybe 90% of the variance is" "retained, maybe as low as 85%." "So 90% would correspond to say" "0.10, kinda 10%." "And so range of values from, you know, 90, 95," "99, maybe as low as 85% of the variables contained would be a fairly typical range in values." "Maybe 95 to 99 is really the most common range of values that people use." "For many data sets you'd be surprised, in order to retain" "99% of the variance, you can often reduce the dimension of the data significantly and still retain most of the variance." "Because for most real life data says many features are just highly correlated, and so it turns out to be possible to compress the data a lot and still retain you know 99% of the variance or 95% of the variance." "So how do you implement this?" "Well, here's one algorithm that you might use." "You may start off, if you want to choose the value of k, we might start off with k equals 1." "And then we run through PCA." "You know, so we compute, you reduce, compute z1, z2, up to zm." "Compute all of those x1 approx and so on up to xm approx and then we check if 99% of the variance is retained." "Then we're good and we use k equals 1." "But if it isn't then what we'll do we'll next try K equals 2." "And then we'll again run through this entire procedure and check, you know is this expression satisfied." "Is this less than 0.01." "And if not then we do this again." "Let's try k equals 3, then try k equals 4, and so on until maybe we get up to k equals" "17 and we find 99% of the data have is retained and then" "we use k equals 17, right?" "That is one way to choose the smallest value of k, so that and 99% of the variance is retained." "But as you can imagine, this procedure seems horribly inefficient" "we're trying k equals one, k equals two, we're doing all these calculations." "Fortunately when you implement" "PCA it actually, in this step, it actually gives us a quantity that makes it much easier to compute these things as well." "Specifically when you're calling" "SVD to get these matrices u, s, and d, when you're calling usvd on the covariance matrix sigma, it also gives us back this matrix" "S and what" "S is, is going to be a square matrix an N by N matrix in fact, that is diagonal." "So is diagonal entries s one one, s two two, s three three down to s n n are going to be the only non-zero elements of this matrix, and everything off the diagonals is going to be zero." "Okay?" "So those big O's that I'm drawing, by that what I mean is that everything off the diagonal of this matrix all of those entries there are going to be zeros." "And so, what is possible to show, and I won't prove this here, and it turns out that for a given value of k, this quantity over here can be computed much more simply." "And that quantity can be computed as one minus sum from i equals 1 through" "K of Sii divided by sum from I equals 1 through N of Sii." "So just to say that it words, or just to take another view of how to explain that, if K equals 3 let's say." "What we're going to do to compute the numerator is sum from one" " I equals 1 through 3 of of Sii, so just compute the sum of these first three elements." "So that's the numerator." "And then for the denominator, well that's the sum of all of these diagonal entries." "And one minus the ratio of that, that gives me this quantity over here, that I've circled in blue." "And so, what we can do is just test if this is less than or equal to 0.01." "Or equivalently, we can test if the sum from i equals 1 through k, s-i-i divided by sum from i equals 1 through n, s-i-i if this is greater than or equal to 4.99, if you want to be sure that 99% of the variance is retained." "And so what you can do is just slowly increase k, set k equals one, set k equals two, set k equals three and so on, and just test this quantity" "to see what is the smallest value of k that ensures that 99% of the variance is retained." "And if you do this, then you need to call the SVD function only once." "Because that gives you the" "S matrix and once you have the S matrix, you can then just keep on doing this calculation by increasing the value of K in the numerator and so you don't need keep to calling SVD over and over again to test out the different values of K." "So this procedure is much more efficient, and this can allow you to select the value of K without needing to run PCA from scratch over and over." "You just run SVD once, this gives you all of these diagonal numbers, all of these numbers S11, S22 down to SNN, and then you can just you know, vary K in this expression to find" "the smallest value of K, so that 99% of the variance is retained." "So to summarize, the way that I often use, the way that I often choose K when I am using PCA for compression is I would call SVD once in the covariance matrix, and then" "I would use this formula and pick the smallest value of" "K for which this expression is satisfied." "And by the way, even if you were to pick some different value of K, even if you were to pick the value of K manually, you know maybe you have a thousand dimensional data and I just want to choose K equals one hundred." "Then, if you want to explain to others what you just did, a good way to explain the performance of your implementation of PCA to them, is actually to take this quantity and compute what this is, and that will" "tell you what was the percentage of variance retained." "And if you report that number, then, you know, people that are familiar with PCA, and people can use this to get a good understanding of how well your hundred dimensional representation is approximating your original data" "set, because there's 99% of variance retained." "That's really a measure of your square of construction error, that ratio being 0.01, just gives people a good intuitive sense of whether your implementation of PCA is finding a good approximation of your original data set." "So hopefully, that gives you an efficient procedure for choosing the number K. For choosing what dimension to reduce your data to, and if you apply PCA to very high dimensional data sets, you know, to like" "a thousand dimensional data, very often, just because data sets tend to have highly correlated features, this is just a property of most of the data sets you see," "you often find that PCA will be able to retain ninety nine per cent of the variance or say, ninety five ninety nine, some high fraction of the variance, even while compressing the data by a very large factor." "In some of the earlier videos," "I was talking about PCA as a compression algorithm where you may have say, a thousand dimensional data and compress it to a hundred dimensional feature back there, or have three dimensional data and compress it to a two dimensional representation." "So, if this is a compression algorithm, there should be a way to go back from this compressed representation, back to an approximation of your original high dimensional data." "So, given z(i), which maybe a hundred dimensional, how do you go back to your original representation x(i), which was maybe a thousand dimensional?" "In this video, I'd like to describe how to do that." "In the PCA algorithm, we may have an example like this." "So maybe that's my example x1 and maybe that's my example x2." "And what we do is, we take these examples and we project them onto this one dimensional surface." "And then now we need to use only a real number, say z1, to specify the location of these points after they've been projected onto this one dimensional surface." "So" ", given a point like this, given a point z1, how can we go back to this original two-dimensional space?" "And in particular, given the point z, which is an r, can we map this back to some approximate representation x and r2 of whatever the original value of the data was?" "So, whereas z equals 0 reduced transverse x, if you want to go in the opposite direction, the equation for that is, we're going to write x approx equals" "U reduce times z." "Again, just to check the dimensions, here U reduce is going to be an n by k dimensional vector, z is going to be a k by 1 dimensional vector." "So, we multiply these out and that's going to be n by one." "So x approx is going to be an n dimensional vector." "And so the intent of PCA, that is, the square projection error is not too big, is that this x approx will be close to whatever was the original value of x that you had used to derive z in the first place." "To show a picture of what this looks like, this is what it looks like." "What you get back of this procedure are points that lie on the projection of that onto the green line." "So to take our early example, if we started off with this value of x1, and got this z1, if you plug z1 through this formula to get x1 approx, then this point here, that will be" "x1 approx, which is going to be r2 and so." "And similarly, if you do the same procedure, this will be x2 approx." "And you know, that's a pretty decent approximation to the original data." "So, that's how you go back from your low dimensional representation z back to an uncompressed representation of the data we get back an the approxiamation to your original data x, and we also call this process reconstruction" "of the original data." "When we think of trying to reconstruct the original value of x from the compressed representation." "So, given an unlabeled data set, you now know how to apply PCA and take your high dimensional features x and map it to this lower dimensional representation z, and from this video, hopefully you now also know how to take" "these low representation z and map the backup to an approximation of your original high dimensional data." "Now that you know how to implement in applying PCA, what we will like to do next is to talk about some of the mechanics of how to actually use PCA well, and in particular, in the next" "video, I like to talk about how to choose K, which is, how to choose the dimension of this reduced representation vector z." "In an earlier video, I had said that PCA can be sometimes used to speed up the running time of a learning algorithm." "In this video, I'd like to explain how to actually do that, and also say some, just try to give some advice about how to apply PCA." "Here's how you can use PCA to speed up a learning algorithm, and this supervised learning algorithm speed up is actually the most common use that I personally make of PCA." "Let's say you have a supervised learning problem, note this is a supervised learning problem with inputs" "X and labels Y, and let's say that your examples xi are very high dimensional." "So, lets say that your examples, xi are 10,000 dimensional feature vectors." "One example of that, would be, if you were doing some computer vision problem, where you have a 100x100 images, and so if you have 100x100, that's 10000 pixels, and so if xi are," "you know, feature vectors that contain your 10000 pixel intensity values, then you have 10000 dimensional feature vectors." "So with very high-dimensional feature vectors like this, running a learning algorithm can be slow, right?" "Just, if you feed 10,000 dimensional feature vectors into logistic regression, or a new network, or support vector machine or what have you, just because that's a lot of data, that's 10,000 numbers," "it can make your learning algorithm run more slowly." "Fortunately with PCA we'll be able to reduce the dimension of this data and so make our algorithms run more efficiently." "Here's how you do that." "We are going first check our labeled training set and extract just the inputs, we're just going to extract the X's and temporarily put aside the Y's." "So this will now give us an unlabelled training set x1 through xm which are maybe there's a ten thousand dimensional data, ten thousand dimensional examples we have." "So just extract the input vectors x1 through xm." "Then we're going to apply PCA and this will give me a reduced dimension representation of the data, so instead of" "10,000 dimensional feature vectors I now have maybe one thousand dimensional feature vectors." "So that's like a 10x savings." "So this gives me, if you will, a new training set." "So whereas previously I might have had an example x1, y1, my first training input, is now represented by z1." "And so we'll have a new sort of training example," "which is Z1 paired with y1." "And similarly Z2, Y2, and so on, up to ZM, YM." "Because my training examples are now represented with this much lower dimensional representation Z1, Z2, up to ZM." "Finally, I can take this reduced dimension training set and feed it to a learning algorithm maybe a neural network, maybe logistic regression, and I can learn the hypothesis H, that takes this input, these low-dimensional" "representations Z and tries to make predictions." "So if I were using logistic regression for example, I would train a hypothesis that outputs, you know, one over one plus E to the negative-theta transpose" "Z, that takes this input to one these z vectors, and tries to make a prediction." "And finally, if you have a new example, maybe a new test example X. What you do is you would take your test example x," "map it through the same mapping that was found by PCA to get you your corresponding z." "And that z then gets fed to this hypothesis, and this hypothesis then makes a prediction on your input x." "One final note, what PCA does is it defines a mapping from x to z and this mapping from x to z should be defined by running" "PCA only on the training sets." "And in particular, this mapping that" "PCA is learning, right, this mapping, what that does is it computes the set of parameters." "That's the feature scaling and mean normalization." "And there's also computing this matrix U reduced." "But all of these things that" "U reduce, that's like a parameter that is learned by PCA and we should be fitting our parameters only to our training sets and not to our cross validation or test sets and so these things the U reduced" "so on, that should be obtained by running PCA only on your training set." "And then having found U reduced, or having found the parameters for feature scaling where the mean normalization and scaling the scale that you divide the features by to get them on to comparable scales." "Having found all those parameters on the training set, you can then apply the same mapping to other examples that may be" "In your cross-validation sets or in your test sets, OK?" "Just to summarize, when you're running PCA, run your" "PCA only on the training set portion of the data not the cross-validation set or the test set portion of your data." "And that defines the mapping from x to z and you can then apply that mapping to your cross-validation set and your test set and by the way in this example I talked about reducing the data from ten thousand dimensional to one" "thousand dimensional, this is actually not that unrealistic." "For many problems we actually reduce the dimensional data." "You know by 5x maybe by 10x and still retain most of the variance and we can do this barely effecting the performance," "in terms of classification accuracy, let's say, barely affecting the classification accuracy of the learning algorithm." "And by working with lower dimensional data our learning algorithm can often run much much faster." "To summarize, we've so far talked about the following applications of PCA." "First is the compression application where we might do so to reduce the memory or the disk space needed to store data and we just talked about how to use this to speed up a learning algorithm." "In these applications, in order to choose K, often we'll do so according to, figuring out what is the percentage of variance retained, and so for this learning algorithm, speed up application often will retain 99% of the variance." "That would be a very typical choice for how to choose k." "So that's how you choose k for these compression applications." "Whereas for visualization applications while usually we know how to plot only two dimensional data or three dimensional data," "and so for visualization applications, we'll usually choose k equals 2 or k equals 3, because we can plot only 2D and 3D data sets." "So that summarizes the main applications of PCA, as well as how to choose the value of k for these different applications." "I should mention that there is often one frequent misuse of PCA and you sometimes hear about others doing this hopefully not too often." "I just want to mention this so that you know not to do it." "And there is one bad use of" "PCA, which iss to try to use it to prevent over-fitting." "Here's the reasoning." "This is not a great way to use PCA, but here's the reasoning behind this method, which is,you know if we have Xi, then maybe we'll have n features, but if we compress the data, and" "use Zi instead and that reduces the number of features to k, which could be much lower dimensional." "And so if we have a much smaller number of features, if k is 1,000 and n is" "10,000, then if we have only 1,000 dimensional data, maybe we're less likely to over-fit than if we were using 10,000-dimensional" "data with like a thousand features." "So some people think of PCA as a way to prevent over-fitting." "But just to emphasize this is a bad application of PCA and I do not recommend doing this." "And it's not that this method works badly." "If you want to use this method to reduce the dimensional data, to try to prevent over-fitting, it might actually work OK." "But this just is not a good way to address over-fitting and instead, if you're worried about over-fitting, there is a much better way to address it, to use regularization instead of using PCA to reduce the dimension of the data." "And the reason is, if you think about how PCA works, it does not use the labels y." "You are just looking at your inputs xi, and you're using that to find a lower-dimensional approximation to your data." "So what PCA does, is it throws away some information." "It throws away or reduces the dimension of your data without knowing what the values of y is, so this is probably okay using PCA this way is probably okay if, say" "99 percent of the variance is retained, if you're keeping most of the variance, but it might also throw away some valuable information." "And it turns out that if you're retaining 99% of the variance or 95% of the variance or whatever, it turns out that just using regularization will often give you at least as good a method for preventing over-fitting" "and regularization will often just work better, because when you are applying linear regression or logistic regression or some other method with regularization, well, this minimization problem actually knows what the values of y are, and so is less likely to throw" "away some valuable information, whereas" "PCA doesn't make use of the labels and is more likely to throw away valuable information." "So, to summarize, it is a good use of PCA, if your main motivation to speed up your learning algorithm, but using" "PCA to prevent over-fitting, that is not a good use of" "PCA, and using regularization instead is really what many people would recommend doing instead." "Finally, one last misuse of PCA." "And so I should say PCA is a very useful algorithm," "I often use it for the compression on the visualization purposes." "But, what I sometimes see, is also people sometimes use PCA where it shouldn't be." "So, here's a pretty common thing that" "I see, which is if someone is designing a machine-learning system, they may write down the plan like this:" "let's design a learning system." "Get a training set and then, you know, what I'm going to do is run PCA, then train logistic regression and then test on my test data." "So often at the very start of a project, someone will just write out a project plan than says lets do these four steps with PCA inside." "Before writing down a project plan the incorporates PCA like this, one very good question to ask is, well, what if we were to just do the whole without using PCA." "And often people do not consider this step before coming up with a complicated project plan and implementing PCA and so on." "And sometime, and so specifically, what I often advise people is, before you implement" "PCA, I would first suggest that, you know, do whatever it is, take whatever it is you want to do and first consider doing it with your original raw data xi, and only if that doesn't do" "what you want, then implement PCA before using Zi." "So, before using PCA you know, instead of reducing the dimension of the data, I would consider well, let's ditch this PCA step, and I would consider, let's just train my learning algorithm" "on my original data." "Let's just use my original raw inputs xi, and I would recommend, instead of putting" "PCA into the algorithm, just try doing whatever it is you're doing with the xi first." "And only if you have a reason to believe that doesn't work, so that only if your learning algorithm ends up running too slowly, or only if the memory requirement or the disk space requirement is too large," "so you want to compress your representation, but if only using the xi doesn't work, only if you have evidence or strong reason to believe that using the xi won't work, then implement" "PCA and consider using the compressed representation." "Because what I do see, is sometimes people start off with a project plan that incorporates PCA inside, and sometimes they, whatever they're doing will work just fine, even with out using PCA instead." "So, just consider that as an alternative as well, before you go to spend a lot of time to get PCA in, figure out what k is and so on." "So, that's it for PCA." "Despite these last sets of comments, PCA is an incredibly useful algorithm, when you use it for the appropriate applications and I've actually used PCA pretty often and for me," "I use it mostly to speed up the running time of my learning algorithms." "But I think, just as common an application of PCA, is to use it to compress data, to reduce the memory or disk space requirements, or to use it to visualize data." "And PCA is one of the most commonly used and one of the most powerful unsupervised learning algorithms." "And with what you've learned in these videos, I think hopefully you'll be able to implement" "PCA and use them through all of these purposes as well." "In this next set of videos," "I'd like to tell you about a problem called Anomaly Detection." "This is a reasonably commonly use you type machine learning." "And one of the interesting aspects is that it's mainly for unsupervised problem, that there's some aspects of it that are also very similar to sort of the supervised learning problem." "So, what is anomaly detection?" "To explain it." "Let me use the motivating example of:" "Imagine that you're a manufacturer of aircraft engines, and let's say that as your aircraft engines roll off the assembly line, you're doing, you know, QA or quality assurance testing, and as part of that testing you" "measure features of your aircraft engine, like maybe, you measure the heat generated, things like the vibrations and so on." "I share some friends that worked on this problem a long time ago, and these were actually the sorts of features that they were collecting off actual aircraft engines so you now have a data set of" "X1 through Xm, if you have manufactured m aircraft engines, and if you plot your data, maybe it looks like this." "So, each point here, each cross here as one of your unlabeled examples." "So, the anomaly detection problem is the following." "Let's say that on, you know, the next day, you have a new aircraft engine that rolls off the assembly line and your new aircraft engine has some set of features x-test." "What the anomaly detection problem is, we want to know if this aircraft engine is anomalous in any way, in other words, we want to know if, maybe, this engine should undergo further testing" "because, or if it looks like an okay engine, and so it's okay to just ship it to a customer without further testing." "So, if your new aircraft engine looks like a point over there, well, you know, that looks a lot like the aircraft engines we've seen before, and so maybe we'll say that it looks okay." "Whereas, if your new aircraft engine, if x-test, you know, were a point that were out here, so that if X1 and" "X2 are the features of this new example." "If x-tests were all the way out there, then we would call that an anomaly." "and maybe send that aircraft engine for further testing before we ship it to a customer, since it looks very different than the rest of the aircraft engines we've seen before." "More formally in the anomaly detection problem, we're give some data sets, x1 through" "Xm of examples, and we usually assume that these end examples are normal or non-anomalous examples, and we want an algorithm to tell us if some new example x-test is anomalous." "The approach that we're going to take is that given this training set, given the unlabeled training set, we're going to build a model for p of x." "In other words, we're going to build a model for the probability of x, where x are these features of, say, aircraft engines." "And so, having built a model of the probability of x we're then going to say that for the new aircraft engine, if p of x-test is less than some epsilon then we flag this as an anomaly." "So we see a new engine that, you know, has very low probability under a model p of x that we estimate from the data, then we flag this anomaly, whereas if p of x-test is, say," "greater than or equal to some small threshold." "Then we say that, you know, okay, it looks okay." "And so, given the training set, like that plotted here, if you build a model, hopefully you will find that aircraft engines, or hopefully the model p of x will say that points that lie, you know, somewhere in the" "middle, that's pretty high probability, whereas points a little bit further out have lower probability." "Points that are even further out have somewhat lower probability, and the point that's way out here, the point that's way out there, would be an anomaly." "Whereas the point that's way in there, right in the middle, this would be okay because p of x right in the middle of that would be very high cause we've seen a lot of points in that region." "Here are some examples of applications of anomaly detection." "Perhaps the most common application of anomaly detection is actually for detection if you have many users, and if each of your users take different activities, you know maybe on your website or in the physical plant or something, you" "can compute features of the different users activities." "And what you can do is build a model to say, you know, what is the probability of different users behaving different ways." "What is the probability of a particular vector of features of a users behavior so you know examples of features of a users activity may be on the website it'd be things like," "maybe x1 is how often does this user log in, x2, you know, maybe the number of what pages visited, or the number of transactions, maybe x3 is, you know, the number of posts of the users on the" "forum, feature x4 could be what is the typing speed of the user and some websites can actually track that was the typing speed of this user in characters per second." "And so you can model p of x based on this sort of data." "And finally having your model p of x, you can try to identify users that are behaving very strangely on your website by checking which ones have probably effects less than epsilon and maybe send the profiles of those users for further review." "Or demand additional identification from those users, or some such to guard against you know, strange behavior or fraudulent behavior on your website." "This sort of technique will tend of flag the users that are behaving unusually, not just" "users that maybe behaving fraudulently." "So not just constantly having stolen or users that are trying to do funny things, or just find unusual users." "But this is actually the technique that is used by many online" "websites that sell things to try identify users behaving strangely that might be indicative of either fraudulent behavior or of computer accounts that have been stolen." "Another example of anomaly detection is manufacturing." "So, already talked about the aircraft engine thing where you can find unusual, say, aircraft engines and send those for further review." "A third application would be monitoring computers in a data center." "I actually have some friends who work on this too." "So if you have a lot of machines in a computer cluster or in a data center, we can do things like compute features at each machine." "So maybe some features capturing you know, how much memory used, number of disc accesses, CPU load." "As well as more complex features like what is the CPU load on this machine divided by the amount of network traffic on this machine?" "Then given the dataset of how your computers in your data center usually behave, you can model the probability of x, so you can model the probability of these machines having different amounts of memory use or probability of these machines having" "different numbers of disc accesses or different CPU loads and so on." "And if you ever have a machine whose probability of x, p of x, is very small then you know that machine is behaving unusually" "and maybe that machine is about to go down, and you can flag that for review by a system administrator." "And this is actually being used today by various data centers to watch out for unusual things happening on their machines." "So, that's anomaly detection." "In the next video, I'll talk a bit about the Gaussian distribution and review properties of the Gaussian probability distribution, and in videos after that, we will apply it to develop an anomaly detection algorithm." "In the next few videos, we'll talk about large scale machine learning." "That is algorithms but viewing with big data sets." "If you look back and a recent 5 or 10-year history of machine learning." "One of the reasons that learning albums works so much better now than even say, 5-years ago, is just the sheer amount of data that we have now and that we can trade our algorithms on." "In these next few videos, we'll talk about algorithms for dealing when we have such massive data sets." "So why do we want to use such large data sets?" "We've already seen that one of the best ways to get a high performance machine learning system, is if you take a low-bias learning algorithm, and train that on a lot of data." "And so, one early example already seen was this example of cos sin between confusable words." "So, for breakfast, I ate two eggs and we saw in this example, these sorts of results, where, you know, so long as you feed the algorithm a lot of data, it seems to do very well." "And so it's results like these that has led to the saying in machine learning that often it's not who has the best algorithm that wins." "It's who has the most data." "So you want to learn from large data sets, at least when we can get such large data sets." "But learning with large data sets comes with its own unique problems, specifically, computational problems." "Let's say your training set size" "is M equals 100,000,000." "And this is actually pretty realistic for many modern data sets." "If you look at the US Census data set, if there are, you know, 300 million people in the" "US, you can usually get hundreds of millions of records." "If you look at the amount of traffic that popular websites get, you easily get training sets that are much larger than hundreds of millions of examples." "And let's say you want to train a linear regression model or maybe a logistic regression model," "in which case this is the gradient descent rule." "And if you look at what you need to do to compute the gradient, which is this term over here, then when" "M is a hundred million, you need to carry out a summation over a hundred million terms," "in order to compute these derivatives terms and to perform a single step of decent." "Because of the computational expense of summing over a hundred million entries in order to compute just one step of gradient descent, in the next few videos we've spoken about techniques for either replacing this with something else or to" "find more efficient ways to compute this derivative." "By the end of this sequence of videos on large scale machine learning, you know how to fit models, linear regression, logistic regression, neural networks and so on even two data sets with, say, a hundred million examples." "Of course, before we put in the effort into training a model with a hundred million examples, We should also ask ourselves, well, why not use just a thousand examples." "Maybe we can randomly pick the subsets of a thousand examples out of a hundred million examples and train our algorithm on just a thousand examples." "So before investing the effort into actually developing and the software needed to train these massive models is often a good sanity check, if training on just a thousand examples might do just as well." "The way to sanity check of using a much smaller training set might do just as well, that is if using a much smaller n equals 1000 size training set, that might do just as well, it is the usual" "method of plotting the learning curves, so if you were to plot the learning curves and if your training objective were to look like this, that's J train theta." "And if your cross-validation set objective, Jcv of theta would look like this, then this looks like a high-variance learning algorithm, and we will be more confident that adding extra training examples would improve performance." "Whereas in contrast if you were to plot the learning curves, if your training objective were to look like this, and if your" "trans-validation objective were to look like that, then this looks like the classical high-bias learning algorithm." "And in the latter case, you know, if you were to plot this up to, say, m equals 1000 and so that is m equals 500 up to m equals 1000, then it seems unlikely that increasing m" "to a hundred million will do much better and then you'd be just fine sticking to n equals 1000, rather than investing a lot of effort to figure out how the scale of the algorithm." "Of course, if you were in the situation shown by the figure on the right, then one natural thing to do would be to add extra features, or add extra hidden units to your neural network and so on, so that you" "end up with a situation closer to that on the left, where maybe this is up to n equals 1000, and this then gives you more confidence that trying to add infrastructure to change the algorithm to use much more than a thousand examples" "that might actually be a good use of your time." "So in large-scale machine learning, we like to come up with computationally reasonable ways, or computationally efficient ways, to deal with very big data sets." "In the next few videos, we'll see two main ideas." "The first is called cost and gradient descent and the second is called Map Reduce, for viewing with very big data sets." "And after you've learned about these methods, hopefully that will allow you to scale up your learning algorithms to big data and allow you to get much better performance on many different applications." "In this video, I'd like to talk about the Gaussian distribution, which is also called the normal distribution." "In case you're already intimately familiar with the Gaussian distribution, it is probably okay to skip this video." "But if you're not sure or if it's been a while since you've worked with a Gaussian distribution or the normal distribution then please do watch this video all the way to the end." "And in the video after this, we'll start applying the Gaussian distribution to developing an anomaly detection algorithm." "Let's say x is a real value random variable, so x is a real number." "If the probability distribution of x is Gaussian, it would mean Mu and variant sigma squared, then we'll write this as x the random variable tilde." "That's this little tilde as distributed as and then to denote the Gaussian Distribution, sometimes you're going to write script n, parentheses" "Mu, sigma squared." "So, this script's end stands for normal, since Gaussian and normal distribution, they mean the same phase of synonymous and a" "Gussian distribution is parameterized by 2 parameters, by a mean parameter which we denote Mu, and a variance parameter, which we denote by sigma squared." "If we pluck the Gaussian distribution or Gaussian probability density, it will look like the bell shaped curve, which you may have seen before." "And so, this bell-shaped curve is parameterized by those 2 parameters Mu and sigma." "And the location of the center of this bell-shaped curve is the mean Mu, and the width of this bell-shaped curve," "so it's roughly that, is the, this parameter sigma, is also called one standard deviation." "And so, this specifies the probability of x taking on different values, so x taking on values, you know in the middle here is pretty high since the Gaussian density here is pretty high whereas x taking on values further and" "further away will be diminishing in probability." "Finally, just for completeness, let me write out the formula for the Gaussian distribution so the property of x and I'll sometimes write this instead of p of x, I'm going to write this as p of" "x semicolon Mu comma sigma squared." "And so this denotes that the probability of x is parametrized by the two parameters Mu and sigma squared." "And the formula for the" "Gaussian density is this, 1 over 2pi, sigma e to the negative x minus Mu squared over 2 sigma squared." "So there's no need to memorize this formula, you know, this is just the formula for the bell-shaped curve over here on the left." "There's no need to memorize it and if you ever need to use this, you can always look this up." "And so that figure on the left, that is what you get if you take a fixed value of Mu and a fixed value of sigma and you plot p of x." "So this curve here, this is really p of x plotted as a function of x, you know, for a fixed value of Mu and of sigma squared sigma squared, that's called the variance." "And sometimes it's easier to think in terms of sigma." "So sigma is called the standard deviation and it, so it specifies the width of this Gaussian probability density whereas the square of sigma, so sigma squared, is called the variance." "Let's look at some examples of what the Gaussian distribution looks like." "If Mu equals zero, sigma equals 1." "Then we have a Gaussian distribution that is centered around zero, because that's Mu." "And the width of this Gaussian, so that's one standard deviation is sigma over there." "Let's look at some examples of" "Gaussians." "If Mu is equal to zero it equals 1." "Then that corresponds to a" "Gaussian distribution that is centered at zero since Mu is zero." "And the width of this Gaussian is Gaussian thus controlled by sigma by that variance parameter sigma." "Here's another example." "Let's say Mu is equal to zero and sigma is equal to one-half." "So the standard deviation is one-half and the variance sigma squared would therefore be the square of 0.5 would be 0.25." "And in that case the Gaussian distribution, the Gaussian probability density looks like this, is also centered at zero." "But now the width of this is much smaller because the smaller variance, the width of this Gaussian density is roughly half as wide." "But because this is a probability distribution, the area under the curve, that is the shaded area there, that area must integrate to 1." "This is a property of probability distributions." "And so, you know, this is a much taller Gaussian density because it's half as wide, with half the standard deviation, but it's twice as tall." "Another example, if sigma is equal to 2, then you get a much fatter, or much wider Gaussian density." "And so here, the sigma parameter controls that this" "Gaussian density has a wider width." "And once again, the area under the curve, that is this shaded area, you know, it always integrates to 1." "That's a property of probability distributions, and because it's wider, it's also half as tall, in order to just integrate to the same thing." "And finally, one last example would be, if we now changed the Mu parameters as well, then instead of being centered at zero, we now we have a Gaussian distribution that is centered at three, because" "this shifts over the entire Gaussian distribution." "Next, lets take about the parameter estimation problem." "So what is the parameter estimation problem?" "Let's say we have a data set of m examples, so x1 through x(m), and let's say each of these examples is a real number." "Here in the figure, I've plotted an example of a data set, so the horizontal axis is the x axis and, you know, I have a range of examples of x and I've just plotted them" "on this figure here." "And the parameter estimation problem is, let's say I suspect that these examples came from a Gaussian distribution so let's say I suspect that each of my example x(i) was distributed." "That's what this tilde thing means." "Thus, I suspect that each of these examples was distributed according to a normal distribution or" "Gaussian distribution with some parameter Mu and some parameter sigma squared." "But I don't know what the values of these parameters are." "The problem with parameter estimation is, given my data set I want to figure out, I want to estimate, what are the values of Mu and sigma squared." "So, if you're given a data set like this, you know, it looks like maybe, if I estimate what Gaussian distribution the data came from, maybe that" "might be roughly the Gaussian distribution it came from, with Mu" "being the center of the distribution and sigma the standard deviation controlling the width of this Gaussian distribution." "It seems like a reasonable fit to the data, because, you know, it looks the data has a very high probability of being in the central region, low probability of" "being further out, low probability of being further out, and so on." "And so, maybe this is a reasonable estimate of" "Mu and of sigma squared, that is, if it corresponds to a Gaussian distribution, that then looks like this." "So, what I'm going to do is write out the formulas, the standard formulas for estimating the parameters from" "Mu and sigma squared." "The way we are going to estimate Mu is going to be just the average of my example." "So Mu is the mean parameter, so I'm going to take my training set, take my m examples and average them." "And that just gives me the center of this distribution." "How about sigma squared?" "Well the variance, I'll just write out the standard formula again," "I'm going to estimate as sum of 1 through m of x(i), minus Mu squared, and so this Mu here is actually the Mu that I compute over here using this formula," "and what the variance is, or one interpretation of the variance, is that, if you look at the this term, that's the square difference between the value" "I've got in my example minus the mean, minus the center, minus the mean of distribution." "And so, you know, the variance, I'm gonna estimate as just the average of the square differences, between my examples, minus the mean." "And as a side comment, only for those of you that are experts in statistics, if you're an expert in statistics and if you've heard of maximum likelihood estimation," "then these estimates are actually the maximum likelihood estimates of the parameters of Mu and sigma squared." "But if you haven't heard of that before, don't worry about it." "All you need to know is that these are the two standard formulas for how you try to figure out what our Mu and sigma squared given the dataset." "Finally one last side comment." "Again only for those of you that has maybe taken a statistics class before." "But if you have taken a statistics class before, some of you may have seen the formula here where, you know, this is m minus" "1, instead of m." "So this first term becomes 1 over m minus 1, instead of 1 over m." "In machine learning, people tend to use this 1 over m formula." "But in practice, whether it is 1 over m or 1 over m minus one, makes essentially no difference, assuming, you know, m is reasonably large, it's a large training set size." "So, just in case you've seen this other version before, in either version it works just equally well, but in machine learning most people tend to use 1 over m in this formula." "And the two versions have slightly different theoretical properties, slightly different mathematical properties, but in practice it really makes very little difference, if any." "So, hopefully, you now have a good sense of what the" "Gaussian distribution looks like, as well as how to estimate the parameters, mu and sigma squared, of the Gaussian distribution, and if you're given a training set, that is if you're given a set of data that you suspect" "comes from a Gaussian distribution with unknown parameters using sigma squared." "In the next video, we'll start to take this and apply it to develop the anomaly detection algorithm." "In the last video, we talked about the Gaussian distribution." "In this video lets apply that to develop an anomaly detection algorithm." "Let's say that we have an unlabeled training set of M examples, and each of these examples is going to be a feature in Rn so your training set could be," "feature vectors from the last" "M aircraft engines being manufactured." "Or it could be features from m users or something else." "The way we are going to address anomaly detection, is we are going to model p of x from the data sets." "We're going to try to figure out what are high probability features, what are lower probability types of features." "So, x is a vector and what we are going to do is model p of x, as probability of x1, that is of the first component of x, times the probability of x2, that is the probability" "of the second feature, times the probability of the third feature, and so on up to the probability of the final feature of Xn." "Now I'm leaving space here cause I'll fill in something in a minute." "So, how do we model each of these terms, p of X1, p of X2, and so on." "What we're going to do, is assume that the feature," "X1, is distributed according to a Gaussian distribution, with some mean, which you want to write as mu1 and some variance, which I'm going to write as sigma squared 1," "and so p of X1 is going to be a Gaussian probability distribution, with mean mu1 and variance sigma squared 1." "And similarly I'm going to assume that X2 is distributed, Gaussian, that's what this little tilda stands for, that means distributed Gaussian with mean mu2 and Sigma squared 2, so it's distributed according to a different Gaussian, which has" "a different set of parameters, mu2 sigma square 2." "And similarly, you know," "X3 is yet another" "Gaussian, so this can have a different mean and a different standard deviation than the other features, and so on, up to XN." "And so that's my model." "Just as a side comment for those of you that are experts in statistics, it turns out that this equation that I just wrote out actually corresponds to an independence assumption on the values of the features x1 through xn." "But in practice it turns out that the algorithm of this fragment, it works just fine, whether or not these features are anywhere close to independent and even if independence assumption doesn't hold true this algorithm works just fine." "But in case you don't know those terms I just used independence assumptions and so on, don't worry about it." "You'll be able to understand it and implement this algorithm just fine and that comment was really meant only for the experts in statistics." "Finally, in order to wrap this up, let me take this expression and write it a little bit more compactly." "So, we're going to write this is a product from J equals one through N, of P of XJ parameterized by mu j comma sigma squared" "j." "So this funny symbol here, there is capital Greek alphabet pi, that funny symbol there corresponds to taking the product of a set of values." "And so, you're familiar with the summation notation, so the sum from i equals one through n, of i." "This means 1 + 2 + 3 plus dot dot dot, up to n." "Where as this funny symbol here, this product symbol, right product from i equals 1 through n of i." "Then this means that, it's just like summation except that we're now multiplying." "This becomes 1 times 2 times 3 times up" "to N. And so using this product notation, this product from j equals 1 through n of this expression." "It's just more compact, it's just shorter way for writing out this product of of all of these terms up there." "Since we're are taking these p of x j given mu j comma sigma squared j terms and multiplying them together." "And, by the way the problem of estimating this distribution p of x, they're sometimes called" "the problem of density estimation." "Hence the title of the slide." "So putting everything together, here is our anomaly detection algorithm." "The first step is to choose features, or come up with features xi that we think might be indicative of anomalous examples." "So what I mean by that, is, try to come up with features, so that when there's an unusual user in your system that may be doing fraudulent things, or when the aircraft engine examples, you know" "there's something funny, something strange about one of the aircraft engines." "Choose features X I, that you think might take on unusually" "large values, or unusually small values, for what an anomalous example might look like." "But more generally, just try to choose features that describe general" "properties of the things that you're collecting data on." "Next, given a training set, of M, unlabled examples," "X1 through X M, we then fit the parameters, mu 1 through mu n, and sigma squared 1 through sigma squared n, and so these were the formulas similar to the formulas we have in the previous video, that we're" "going to use the estimate each of these parameters, and just to give some interpretation, mu J, that's my average value of the j feature." "Mu j goes in this term p of xj." "which is parametrized by mu J and sigma squared J. And so this says for the mu J just take the mean over my training set of the values of the j feature." "And, just to mention, that you do this, you compute these formulas for j equals one through n." "So use these formulas to estimate mu 1, to estimate mu" "2, and so on up to mu n, and similarly for sigma squared, and it's also possible to come up with vectorized versions of these." "So if you think of mu as a vector, so mu if is a vector there's mu 1, mu 2, down to mu n, then a vectorized version of that set of parameters can be written" "like so sum from 1 equals one through n xi." "So, this formula that I just wrote out estimates this xi as the feature vectors that estimates mu for all the values of n simultaneously." "And it's also possible to come up with a vectorized formula for estimating sigma squared j." "Finally, when you're given a new example, so when you have a new aircraft engine and you want to know is this aircraft engine anomalous." "What we need to do is then compute p of x, what's the probability of this new example?" "So, p of x is equal to this product, and what you implement, what you compute, is this formula and where over here, this thing here this is just the formula for the Gaussian probability, so you compute" "this thing, and finally if this probability is very small, then you flag this thing as an anomaly." "Here's an example of an application of this method." "Let's say we have this data set plotted on the upper left of this slide." "if you look at this, well, lets look the feature of x1." "If you look at this data set, it looks like on average, the features x1 has a mean of about 5" "and the standard deviation, if you only look at just the x1 values of this data set has the standard deviation of maybe 2." "So that sigma 1 and looks like x2 the values of the features as measured on the vertical axis, looks like it has an average value of about 3, and a standard deviation of about 1." "So if you take this data set and if you estimate mu1, mu2, sigma1, sigma2, this is what you get." "And again, I'm writing sigma here," "I'm think about standard deviations, but the formula on the previous 5 actually gave the estimates of the squares of theses things, so sigma squared 1 and sigma squared 2." "So, just be careful whether you are using sigma 1, sigma" "2, or sigma squared 1 or sigma squared 2." "So, sigma squared 1 of course would be equal to 4, for" "example, as the square of 2." "And in pictures what p of x1 parametrized by mu1 and sigma squared 1 and p of x2, parametrized by mu" "2 and sigma squared 2, that would look like these two distributions over here." "And, turns out that if were to plot of p of x, right, which is the product of these two things, you can actually get a surface plot that looks like this." "This is a plot of p of x, where the height above of this, where the height of this surface at a particular point, so given a particular x1 x2 values of x2 if x1 equals 2, x equal 2, that's this point." "And the height of this 3-D surface here, that's p" "of x." "So p of x, that is the height of this plot, is literally just p of x1" "parametrized by mu 1 sigma squared 1, times p of x2 parametrized by mu 2 sigma squared 2." "Now, so this is how we fit the parameters to this data." "Let's see if we have a couple of new examples." "Maybe I have a new example there." "Is this an anomaly or not?" "Or, maybe I have a different example, maybe I have a different second example over there." "So, is that an anomaly or not?" "They way we do that is, we would set some value for" "Epsilon, let's say I've chosen" "Epsilon equals 0.02." "I'll say later how we choose Epsilon." "But let's take this first example, let me call this example X1 test." "And let me call the second example" "X2 test." "What we do is, we then compute p of" "X1 test, so we use this formula to compute it and this looks like a pretty large value." "In particular, this is greater than, or greater than or equal to epsilon." "And so this is a pretty high probability at least bigger than epsilon, so we'll say that" "X1 test is not an anomaly." "Whereas, if you compute p of" "X2 test, well that is just a much smaller value." "So this is less than epsilon and so we'll say that that is indeed an anomaly, because it is much smaller than that epsilon that we then chose." "And in fact, I'd improve it here." "What this is really saying is that, you look through the 3d surface plot." "It's saying that all the values of x1 and x2 that have a high height above the surface, corresponds to an a non-anomalous example of an OK or normal example." "Whereas all the points far out here, all the points out here, all of those points have very low probability, so we are going to flag those points as anomalous, and so it's gonna define some region, that maybe looks" "like this, so that everything outside this, it flags as anomalous," "whereas the things inside this ellipse I just drew, if it considers okay, or non-anomalous, not anomalous examples." "And so this example x2 test lies outside that region, and so it has very small probability, and so we consider it an anomalous example." "In this video we talked about how to estimate p of x, the probability of x, for the purpose of developing an anomaly detection algorithm." "And in this video, we also stepped through an entire process of giving data set, we have, fitting the parameters, doing parameter estimations." "We get mu and sigma parameters, and then taking new examples and deciding if the new examples are anomalous or not." "In the next few videos we will delve deeper into this algorithm, and talk a bit more about how to actually get this to work well." "In the last video, we developed an anomaly detection algorithm." "In this video, I like to talk about the process of how to go about developing a specific application of anomaly detection to a problem and in particular this will focus on the problem of how to evaluate an anomaly detection algorithm." "In previous videos, we've already talked about the importance of real number evaluation and this captures the idea that when you're trying to develop a learning algorithm for a specific application, you need to often make a lot of choices" "like, you know, choosing what features to use and then so on." "And making decisions about all of these choices is often much easier, and if you have a way to evaluate your learning algorithm that just gives you back a number." "So if you're trying to decide, you know, I have an idea for one extra feature, do I include this feature or not." "If you can run the algorithm with the feature, and run the algorithm without the feature, and just get back a number that tells you, you know, did it improve or worsen performance to add this feature?" "Then it gives you a much better way, a much simpler way, with which to decide whether or not to include that feature." "So in order to be able to develop an anomaly detection system quickly, it would be a really helpful to have a way of evaluating an anomaly detection system." "In order to do this, in order to evaluate an anomaly detection system, we're actually going to assume have some labeled data." "So, so far, we'll be treating anomaly detection as an unsupervised learning problem, using unlabeled data." "But if you have some labeled data that specifies what are some anomalous examples, and what are some non-anomalous examples, then this is how we actually think of as the standard way of evaluating an anomaly detection algorithm." "So taking the aircraft engine example again." "Let's say that, you know, we have some label data of just a few anomalous examples of some aircraft engines that were manufactured in the past that turns out to be anomalous." "Turned out to be flawed or strange in some way." "Let's say we use we also have some non-anomalous examples, so some perfectly okay examples." "I'm going to use y equals 0 to denote the normal or the non-anomalous example and y equals 1 to denote the anomalous examples." "The process of developing and evaluating an anomaly detection algorithm is as follows." "We're going to think of it as a training set and talk about the cross validation in test sets later, but the training set we usually think of this as still the unlabeled" "training set." "And so this is our large collection of normal, non-anomalous or not anomalous examples." "And usually we think of this as being as non-anomalous, but it's actually okay even if a few anomalies slip into your unlabeled training set." "And next we are going to define a cross validation set and a test set, with which to evaluate a particular anomaly detection algorithm." "So, specifically, for both the cross validation test sets we're going to assume that, you know, we can include a few examples in the cross validation set and the test set that contain examples that are known to be anomalous." "So the test sets say we have a few examples with y equals 1 that correspond to anomalous aircraft engines." "So here's a specific example." "Let's say that, altogether, this is the data that we have." "We have manufactured 10,000 examples of engines that, as far as we know we're perfectly normal, perfectly good aircraft engines." "And again, it turns out to be okay even if a few flawed engine slips into the set of" "10,000 is actually okay, but we kind of assumed that the vast majority of these" "10,000 examples are, you know, good and normal non-anomalous engines." "And let's say that, you know, historically, however long we've been running on manufacturing plant, let's say that we end up getting features, getting 24 to 28 anomalous engines as well." "And for a pretty typical application of anomaly detection, you know, the number non-anomalous" "examples, that is with y equals 1, we may have anywhere from, you know, 20 to 50." "It would be a pretty typical range of examples, number of examples that we have with y equals 1." "And usually we will have a much larger number of good examples." "So, given this data set, a fairly typical way to split it into the training set, cross validation set and test set would be as follows." "Let's take 10,000 good aircraft engines and put 6,000 of that into the unlabeled training set." "So, I'm calling this an unlabeled training set but all of these examples are really ones that correspond to y equals 0, as far as we know." "And so, we will use this to fit p of x, right." "So, we will use these 6000 engines to fit p of x, which is that p of x one parametrized by Mu" "1, sigma squared 1, up to p of Xn parametrized by Mu N sigma squared" "n." "And so it would be these 6,000 examples that we would use to estimate the parameters" "Mu 1, sigma squared 1, up to Mu N, sigma squared N. And so that's our training set of all, you know, good, or the vast majority of good examples." "Next we will take our good aircraft engines and put some number of them in a cross validation set plus some number of them in the test sets." "So 6,000 plus 2,000 plus 2,000, that's how we split up our" "10,000 good aircraft engines." "And then we also have 20 flawed aircraft engines, and we'll take that and maybe split it up, you know, put ten of them in the cross validation set and put ten of them in the test sets." "And in the next slide we will talk about how to actually use this to evaluate the anomaly detection algorithm." "So what I have just described here is a you know probably the recommend a good way of splitting the labeled and unlabeled example." "The good and the flawed aircraft engines." "Where we use like a 60, 20, 20% split for the good engines and we take the flawed engines, and we put them just in the cross validation set, and just in the test set, then we'll see in the next slide why that's the case." "Just as an aside, if you look at how people apply anomaly detection algorithms, sometimes you see other peoples' split the data differently as well." "So, another alternative, this is really not a recommended alternative, but some people want to take off your 10,000 good engines, maybe put 6000 of them in your training set and then put the same" "4000 in the cross validation set and the test set." "And so, you know, we like to think of the cross validation set and the test set as being completely different data sets to each other." "But you know, in anomaly detection, you know, for sometimes you see people, sort of, use the same set of good engines in the cross validation sets, and the test sets, and sometimes you" "see people use exactly the same sets of anomalous" "engines in the cross validation set and the test set." "And so, all of these are considered, you know, less good practices and definitely less recommended." "Certainly using the same data in the cross validation set and the test set, that is not considered a good machine learning practice." "But, sometimes you see people do this too." "So, given the training cross validation and test sets, here's how you evaluate or here is how you develop and evaluate an algorithm." "First, we take the training sets and we fit the model p of x." "So, we fit, you know, all these" "Gaussians to my m unlabeled examples of aircraft engines, and these, I am calling them unlabeled examples, but these are really examples that we're assuming our goods are the normal aircraft engines." "Then imagine that your anomaly detection algorithm is actually making prediction." "So, on the cross validation of the test set, given that, say, test example X, think of the algorithm as predicting that y is equal to 1, p of x is less than epsilon," "we must be taking zero, if p of x is greater than or equal to epsilon." "So, given x, it's trying to predict, what is the label, given y equals 1 corresponding to an anomaly or is it y equals 0 corresponding to a normal example?" "So given the training, cross validation, and test sets." "How do you develop an algorithm?" "And more specifically, how do you evaluate an anomaly detection algorithm?" "Well, to this whole, the first step is to take the unlabeled training set, and to fit the model p of x lead training data." "So you take this, you know on I'm coming, unlabeled training set, but really, these are examples that we are assuming, vast majority of which are normal aircraft engines, not because they're not anomalies" "and it will fit the model p of x." "It will fit all those parameters for all the Gaussians on this data." "Next on the cross validation of the test set, we're going to think of the anomaly detention algorithm as trying to predict the value of y." "So in each of like say test examples." "We have these X-I tests," "Y-I test, where y is going to be equal to" "1 or 0 depending on whether this was an anomalous example." "So given input x in my test set, my anomaly detection algorithm think of it as predicting the y as 1 if p of x is less than epsilon." "So predicting that it is an anomaly, it is probably is very low." "And we think of the algorithm is predicting that y is equal to 0." "If p of x is greater then or equals epsilon." "So predicting those normal example if the p of x is reasonably large." "And so we can now think of the anomaly detection algorithm as making predictions for what are the values of these y labels in the test sets or on the cross validation set." "And this puts us somewhat more similar to the supervised learning setting, right?" "Where we have label test set and our algorithm is making predictions on these labels and so we can evaluate it you know by seeing how often it gets these labels right." "Of course these labels are will be very skewed because y equals zero, that is normal examples, usually be much more common than y equals" "1 than anomalous examples." "But, you know, this is much closer to the source of evaluation metrics we can use in supervised learning." "So what's a good evaluation metric to use." "Well, because the data is very skewed, because y equals 0 is much more common, classification accuracy would not be a good the evaluation metrics." "So, we talked about this in the earlier video." "So, if you have a very skewed data set, then predicting y equals 0 all the time, will have very high classification accuracy." "Instead, we should use evaluation metrics, like computing the fraction of true positives, false positives, false negatives, true negatives or compute the position of the v curve of this algorithm or do things like compute the" "f1 score, right, which is a single real number way of summarizing the position and the recall numbers." "And so these would be ways to evaluate an anomaly detection algorithm on your cross validation set or on your test set." "Finally, earlier in the anomaly detection algorithm, we also had this parameter epsilon, right?" "So, epsilon is this threshold that we would use to decide when to flag something as an anomaly." "And so, if you have a cross validation set, another way to and to choose this parameter epsilon, would be to try a different, try many different values of epsilon, and then pick the value of epsilon" "that, let's say, maximizes f1 score, or that otherwise does well on your cross validation set." "And more generally, the way to reduce the training, testing, and cross validation sets, is that" "when we are trying to make decisions, like what features to include, or trying to, you know, tune the parameter epsilon, we would then continually evaluate the algorithm on the cross validation sets and make all those decisions like what" "features did you use, you know, how to set epsilon, use that, evaluate the algorithm on the cross validation set, and then when we've picked the set of features, when we've found the value of" "epsilon that we're happy with, we can then take the final model and evaluate it, you know, do the final evaluation of the algorithm on the test sets." "So, in this video, we talked about the process of how to evaluate an anomaly detection algorithm, and again, having being able to evaluate an algorithm, you know, with a single real number evaluation," "with a number like an F1 score that often allows you to much more efficient use of your time when you are trying to develop an anomaly detection system." "And we try to make these sorts of decisions." "I have to chose epsilon, what features to include, and so on." "In this video, we started to use a bit of labeled data in order to evaluate the anomaly detection algorithm and this takes us a little bit closer to a supervised learning setting." "In the next video, I'm going to say a bit more about that." "And in particular we'll talk about when should you be using an anomaly detection algorithm and when should we be thinking about using supervised learning instead, and what are the differences between these two formalisms." "For many learning algorithms, among them linear regression, logistic regression and neural networks, the way we derive the algorithm was by coming up with a cost function or coming up with an optimization objective." "And then using an algorithm like gradient descent to minimize that cost function." "We have a very large training set gradient descent becomes a computationally very expensive procedure." "In this video, we'll talk about a modification to the basic gradient descent algorithm called Stochastic gradient descent, which will allow us to scale these algorithms to much bigger training sets." "Suppose you are training a linear regression model using gradient descent." "As a quick recap, the hypothesis will look like this, and the cost function will look like this, which is the sum of one half of the average square error of your hypothesis on your m training examples," "and the cost function we've already seen looks like this sort of bow-shaped function." "So, plotted as function of the parameters theta 0 and theta 1, the cost function J is a sort of a bow-shaped function." "And gradient descent looks like this, where in the inner loop of gradient descent you repeatedly update the parameters theta using that expression." "Now in the rest of this video, I'm going to keep using linear regression as the running example." "But the ideas here, the ideas of Stochastic gradient descent is fully general and also applies to other learning algorithms" "like logistic regression, neural networks and other algorithms that are based on training gradient descent on a specific training set." "So here's a picture of what gradient descent does, if" "the parameters are initialized to the point there then as you run gradient descent different iterations of gradient descent will take the parameters to the global minimum." "So take a trajectory that looks like that and heads pretty directly to the global minimum." "Now, the problem with gradient descent is that if m is large." "Then computing this derivative term can be very expensive, because the surprise, summing over all m examples." "So if m is 300 million, alright." "So in the United States, there are about 300 million people." "And so the US or United" "States census data may have on the order of that many records." "So you want to fit the linear regression model to that then you need to sum over 300 million records." "And that's very expensive." "To give the algorithm a name, this particular version of gradient descent is also called Batch gradient descent." "And the term Batch refers to the fact that we're looking at all of the training examples at a time." "We call it sort of a batch of all of the training examples." "And it really isn't the, maybe the best name but this is what machine learning people call this particular version of gradient descent." "And if you imagine really that you have 300 million census records stored away on disc." "The way this algorithm works is you need to read into your computer memory all 300 million records in order to compute this derivative term." "You need to stream all of these records through computer because you can't store all your records in computer memory." "So you need to read through them and slowly, you know, accumulate the sum in order to compute the derivative." "And then having done all that work, that allows you to take one step of gradient descent." "And now you need to do the whole thing again." "You know, scan through all 300 million records, accumulate these sums." "And having done all that work, you can take another little step using gradient descent." "And then do that again." "And then you take yet a third step." "And so on." "And so it's gonna take a long time in order to get the algorithm to converge." "In contrast to Batch gradient descent, what we are going to do is come up with a different algorithm that doesn't need to look at all the training examples in every single iteration, but that needs to look at only a single" "training example in one iteration." "Before moving on to the new algorithm, here's just a" "Batch gradient descent algorithm written out again with that being the cost function and that being the update and of course this term here, that's used in the gradient descent rule, that is the partial derivative with respect" "to the parameters theta J of our optimization objective, J train of theta." "Now, let's look at the more efficient algorithm that scales better to large data sets." "In order to work off the algorithms called Stochastic gradient descent, this vectors the cost function in a slightly different way then they define the cost of the parameter theta with respect to a training example x(i), y(i) to be equal" "to one half times the squared error that my hypothesis incurs on that example, x(i), y(i)." "So this cost function term really measures how well is my hypothesis doing on a single example x(i), y(i)." "Now you notice that the overall cost function j train can now be written in this equivalent form." "So j train is just the average over my m training examples of the cost of my hypothesis on that example x(i), y(i)." "Armed with this view of the cost function for linear regression, let me now write out what Stochastic gradient descent does." "The first step of Stochastic gradient descent is to randomly" "shuffle the data set." "So by that I just mean randomly shuffle, or randomly reorder your m training examples." "It's sort of a standard pre-processing step, come back to this in a minute." "But the main work of" "Stochastic gradient descent is then done in the following." "We're going to repeat for i equals 1 through m." "So we'll repeatedly scan through my training examples and perform the following update." "Gonna update the parameter theta j as theta j minus alpha times h of x(i) minus y(i)" "times x(i)j." "And we're going to do this update as usual for all values of j." "Now, you notice that this term over here is exactly" "what we had inside the summation for Batch gradient" "descent." "In fact, for those of you that are calculus is possible to show that that term here, that's this term here, is equal to the partial derivative with respect to my parameter theta j of the cost of the" "parameters theta on x(i), y(i)." "Where cost is of course this thing that was defined previously." "And just the wrap of the algorithm, let me close my curly braces over there." "So what Stochastic gradient descent is doing is it is actually scanning through the training examples." "And first it's gonna look at my first training example x(1), y(1)." "And then looking at only this first example, it's gonna take like a basically a little gradient descent step with respect to the cost of just this first training example." "So in other words, we're going to look at the first example and modify the parameters a little bit to fit just the first training example a little bit better." "Having done this inside this inner for-loop is then going to go on to the second training example." "And what it's going to do there is take another little step in parameter space, so modify the parameters just a little bit to try to fit just a second training example a little bit better." "Having done that, is then going to go onto my third training example." "And modify the parameters to try to fit just the third training example a little bit better, and so on until you know, you get through the entire training set." "And then this ultra repeat loop may cause it to take multiple passes over the entire training set." "This view of Stochastic gradient descent also motivates why we wanted to start by randomly shuffling the data set." "This doesn't show us that when we scan through the training site here, that we end up visiting the training examples in some sort of randomly sorted order." "Depending on whether your data already came randomly sorted or whether it came originally sorted in some strange order, in practice this would just speed up the conversions to Stochastic gradient descent just a little bit." "So in the interest of safety, it's usually better to randomly shuffle the data set if you aren't sure if it came to you in randomly sorted order." "But more importantly another view of Stochastic gradient descent is that it's a lot like descent but rather than wait to sum up these gradient terms over all m training examples, what we're doing is we're taking this" "gradient term using just one single training example and we're starting to make progress in improving the parameters already." "So rather than, you know, waiting 'till taking a path through all 300,000 United States Census records, say, rather than needing to scan through all of the training examples before we can modify the parameters a little" "bit and make progress towards a global minimum." "For Stochastic gradient descent instead we just need to look at a single training example and we're already starting to make progress in this case of parameters towards, moving the parameters towards the global minimum." "So, here's the algorithm written out again where the first step is to randomly shuffle the data and the second step is where the real work is done, where that's the update with respect to a single training example x(i), y(i)." "So, let's see what this algorithm does to the parameters." "Previously, we saw that when we are using Batch gradient descent, that is the algorithm that looks at all the training examples in time," "Batch gradient descent will tend to, you know, take a reasonably" "straight line trajectory to get to the global minimum like that." "In contrast with Stochastic gradient descent every iteration is going to be much faster because we don't need to sum up over all the training examples." "But every iteration is just trying to fit single training example better." "So, if we were to start stochastic gradient descent, oh, let's start stochastic gradient descent at a point like that." "The first iteration, you know, may take the parameters in that direction and maybe the second iteration looking at just the second example maybe just by chance, we get more unlucky and actually head in a bad direction with the parameters like that." "In the third iteration where we tried to modify the parameters to fit just the third training examples better, maybe we'll end up heading in that direction." "And then we'll look at the fourth training example and we will do that." "The fifth example, sixth example, 7th and so on." "And as you run Stochastic gradient descent, what you find is that it will generally move the parameters" "in the direction of the global minimum, but not always." "And so take some more random-looking, circuitous path to watch the global minimum." "And in fact as you run" "Stochastic gradient descent it doesn't actually converge in the same same sense as Batch gradient descent does and what it ends up doing is wandering around continuously in some region that's in some region close to the global minimum, but it doesn't" "just get to the global minimum and stay there." "But in practice this isn't a problem because, you know, so long as the parameters end up in some region there maybe it is pretty close to the global minimum." "So, as parameters end up pretty close to the global minimum, that will be a pretty good hypothesis and so" "usually running Stochastic gradient descent we get a parameter near the global minimum and that's good enough for, you know, essentially any, most practical purposes." "Just one final detail." "In Stochastic gradient descent, we had this outer loop repeat which says to do this inner loop multiple times." "So, how many times do we repeat this outer loop?" "Depending on the size of the training set, doing this loop just a single time may be enough." "And up to, you know, maybe 10 times may be typical so we may end up repeating this inner loop anywhere from once to ten times." "So if we have a you know, truly massive data set like the this US census gave us that example that I've been talking about with" "300 million examples, it is possible that by the time you've taken just a single pass through your training set." "So, this is for i equals 1 through 300 million." "It's possible that by the time you've taken a single pass through your data set you might already have a perfectly good hypothesis." "In which case, you know, this inner loop you might need to do only once if m is very, very large." "But in general taking anywhere from 1 through 10 passes through your data set, you know, maybe fairly common." "But really it depends on the size of your training set." "And if you contrast this to Batch gradient descent." "With Batch gradient descent, after taking a pass through your entire training set, you would have taken just one single gradient descent steps." "So one of these little baby steps of gradient descent where you just take one small gradient descent step and this is why Stochastic gradient descent can be much faster." "So, that was the Stochastic gradient descent algorithm." "And if you implement it, hopefully that will allow you to scale up many of your learning algorithms to much bigger data sets and get much more performance that way." "In the previous video, we talked about Stochastic gradient descent, and how that can be much faster than Batch gradient descent." "In this video, let's talk about another variation on these ideas is called Mini-batch gradient descent they can work sometimes even a bit faster than stochastic gradient descent." "To summarize the algorithms we talked about so far." "In Batch gradient descent we will use all m examples in each generation." "Whereas in Stochastic gradient descent we will use a single example in each generation." "What Mini-batch gradient descent does is somewhere in between." "Specifically, with this algorithm we're going to use b examples in each iteration where b is a parameter called the mini" "batch size so the idea is that this is somewhat in-between Batch gradient descent and Stochastic gradient descent." "This is just like batch gradient descent, except that I'm going to use a much smaller batch size." "A typical tourist for the value of b might be b equals 10 lets say and a typical range really might be anywhere from b equals 2 up to b equals 100." "So that will be a pretty typical range of values for the" "Mini-batch size." "And the idea is that rather than using one example at a time or m examples at a time we will use b examples at a time." "So let me just write this off informally, we're going to get, let's say, b." "For this example, let's say b equals 10." "So we're going to get, you know, the next 10 examples from my training set so that may be some set of examples xi, yi." "If it's 10 examples then the indexing will be up to x (i+9)," "y (i+9) so that's 10 examples altogether and then we'll perform essentially a" "gradient descent update using these 10 examples." "So, that's any rate times one tenth times sum over k equals i through i+9 of x subscript theta of x(k) minus y(k) you times" "x(k)j." "And so in this expression, where summing the gradient terms over my ten examples." "So, that's number ten, that's, you know, my mini batch size and just i+9 again, the 9 comes from the choice of the parameter b, and then after this we will then" "increase, you know, i by tenth, we will go on to the next ten examples and then keep moving like this." "So just to write out the entire algorithm in full." "In order to simplify the indexing for this one at the right top, I'm going to assume we have a mini-batch size of ten and a training set size of a thousand, what we're going to do is have this" "sort of form, you know, for i equals 1 and that in 21's the stepping, in steps of" "10 because we look at 10 examples at a time." "And then we perform this sort of gradient descent update using ten examples at a time so this 10 and this i+9 those are consequence" "of having chosen my mini-batch to be ten." "And you know, this ultimate four-loop, this ends at 991 here because if I have 1000 training samples then" "I need 100 steps of size 10 in order to get through my training set." "So this is mini-batch gradient descent." "Compared to batch gradient descent, this also allows us to make progress much faster." "So we have again our running example of, you know," "U.S. Census data with 300 million training examples, then what we're saying is after looking at just the first 10 examples we can start to make progress in improving the parameters theta so we don't' need to scan through the entire training set." "We just need to look at the first 10 examples and this will start letting us make progress and then we can look at the second ten examples and modify the parameters a little bit again and so on." "So, that is why Mini-batch gradient descent can be faster than batch gradient descent." "Namely, you can start making progress in modifying the parameters after looking at just ten examples rather than needing to wait 'till you've scan through every single training" "example of 300 million of them." "So, how about Mini-batch gradient descent versus Stochastic gradient descent." "So, why do we want to look at b examples at a time rather than look at just a single example at a time as the Stochastic gradient descent?" "The answer is in vectorization." "In particular, Mini-batch gradient descent is likely to outperform Stochastic gradient descent only if you have a good vectorized implementation." "In that case, the sum over 10 examples can be performed in a more vectorized way which will allow you to" "partially parallelize your computation over the ten examples." "So, in other words, by using appropriate vectorization to compute the rest of the terms, you can sometimes partially use the good numerical algebra libraries and parallelize your gradient computations over the b examples, whereas if you were looking at just a" "single example of time with Stochastic gradient descent then you know just looking at one example at a time their isn't much to parallelize over." "At least there is less to parallelize over." "One disadvantage of Mini-batch gradient descent is that there is now this extra parameter b, the" "Mini-batch size which you may have to fiddle with, and which may therefore take time." "But if you have a good vectorized implementation this can sometimes run even faster that Stochastic gradient descent." "So that was Mini-batch gradient descent which is an algorithm that in some sense does something that's somewhat in between what Stochastic gradient descent does and what Batch gradient descent does." "And if you choose their reasonable value of b." "I usually use b equals 10, but, you know, other values, anywhere from say 2 to 100, would be reasonably common." "So we choose value of b and if you use a good vectorized implementation, sometimes t can be faster than both Stochastic gradient descent and faster than Batch gradient descent." "You now know about the stochastic gradient descent algorithm." "But when you're running the algorithm, how do you make sure that it's completely debugged and is converging okay?" "Equally important, how do you tune the learning rate alpha with Stochastic Gradient Descent." "In this video we talk about some techniques for doing these things, for making sure it's converging and for picking the learning rate alpha." "Back when we were using batch gradient descent our standard way for making sure that when the set was converging was we would plot the authorization cost function as a function of the number of iterations." "So that was the cost function and we would make sure that this cost function is decreasing on every iteration." "When the training set sizes were small we could do that because we could compute the sum pretty efficiently when you have a massive training set size" "then you don't want to have to pause your algorithm periodically." "You don't want to have to stochastic gradient descent periodically in order to compute this cost function since it requires a sum of your entire training set size." "And the whole point of stochastic gradient was that you wanted to start [XX] after just looking at a single example without needing to occasionally scan through your entire training set right in the middle of the algorithm just to confuse things like the" "cost function of the entire training set converging," "here's what we can do instead." "Let's take the definition of the cost that we had previously." "So the cost of the parameters theta with respect to a single training example is just one half of the square error on that training example." "Then, while stochastic gradient descent is learning, right before we we train on a specific example." "So, in stochastic gradient descent we're going to look at the examples xi, yi, in order, and then sort of take a little update with respect to this example." "And we go on to the next example, xi plus 1, yi plus 1, and so on, right?" "That's what stochastic gradient descent does." "So, while the algorithm is looking at the example xi, yi, but before" "it has updated the parameters theta using that um, example, let's compute the cost of that example." "Just to say the same thing again, but using slightly different words." "stochastic gradient descent is scanning through our training set" "right before we have updated theta using a specific training example x i comma y i let's compute how well our hypothesis is doing on that training example." "And we want to do this before updating theta because if we've just updated theta using example, you know, that it might be doing better on that example than what would be representative." "Finally, in order to check for the convergence [xx] stochastic gradient descent, what we can do is every, say, every thousand iterations, we can plot these costs that we've been computing in the previous step." "We can plot those costs average over, say, the last thousand examples processed by the algorithm." "And if you do this, it kind of gives you a running estimate of how well the algorithm is doing." "on, you know, the last 1000 training examples that your has seen." "So, in contrast to computing which needed to scan through the entire training set." "With this other procedure, well, as part of stochastic gradient descent, it doesn't cost much to compute these costs as well right be updating to parameter theta." "And all we're doing is every thousand integrations or so, we just average the last 1,000 costs that we computed and plot that." "And by looking at those plots, this will allow us to check if stochastic gradient descent is converging." "So here are a few examples of what these plots might look like." "Suppose you have plotted the cost average over the last thousand examples, because these are averaged over just a thousand examples they are going to be a little bit noisy and so, it may not decrease on every single iteration." "Then if you get a figure that looks like this, So the plot is noisy because it's average over, you know, just a small subset, say a thousand training examples." "If you get a figure that looks like this, you know that would be a pretty decent run with the algorithm, maybe, where it looks like the cost has gone down and then this plateau that looks kind of" "flattened out, you know, starting from around there." "look like, this is what your cost looks like then maybe your learning algorithm has converged." "If you want to try using a smaller learning rate, something you might see is that the algorithm may initially learn more slowly so the cost goes down more slowly." "But then eventually you have a smaller learning rate is actually possible for the algorithm to end up at a, maybe very slightly better solution." "So the red line may represent the behavior of stochastic gradient descent using a slower, using a smaller And the reason this is the case is because, you remember, stochastic gradient descent doesn't just converge the global minimum, is that" "what it does is the parameters will oscillate a bit around the global minimum." "So by using a smaller learning rate, you'll end up with smaller oscillations." "And Sometimes this little difference will be negligible" "and sometimes with a smaller than you can get a slightly better value for the parameters." "Here are some other things that might happen." "Let's say you run stochastic gradient descent and you average over a thousand examples when plotting these costs." "So, you know, here might be the result of averaging another one of these plots." "Then again, it kind of looks like it's converged." "If you were to take this number, a thousand, and increase to averaging over 5 thousand examples." "Then it's possible that you might get a smoother curve that looks more like this." "And by averaging over, say 5,000 examples instead of" "1,000, you might be able to get a smoother curve like this." "And so that's the effect of increasing the number of examples you have is your over." "The disadvantage of making this too big of course is that now you get one date point only every 5,000 examples." "And so the feedback you get on how well your learning learning algorithm is doing is, sort of, maybe it's more delayed because you get one data point on your plot only every 5,000 examples rather than every 1,000 examples." "Along a similar vein some times you may run a gradient descent and end up with a plot that looks like this." "And with a plot that looks like this, you know, it looks like the cost just is not decreasing at all." "It looks like the algorithm is just not learning." "It's just, looks like things get flat, basically a flat curve and the cost is just not decreasing but, again if you were to increase this" "to averaging over a larger number of examples it is possible that you see something like this it looks like the cost actually is decreasing, it's just that the blue line averaging over 2, 3 examples, the" "blue line was too noisy so you couldn't see the actual trend in the cost actually decreasing and possibly averaging over 5,000 examples instead of 1,000 may help." "Of course we averaged over a larger number examples that we've averaged here over 5,000 examples, I'm just using a different color, it is also possible that you that see a learning curve ends up looking like this." "That it's still fat even when you average over a larger number of examples." "And as you get that, then that's maybe just a more firm verification that unfortunately the algorithm just isn't learning much for whatever reason." "And you need to either change the learning rate or change the features or change something else about the algorithm." "Finally, one last thing that you might see would be if you were to plot these and you see a curve that looks like this, where it actually looks like it's increasing." "And if that's the case then this is a sign that the algorithm is diverging." "And What you really should do is use a smaller value of the learning rate alpha." "So hopefully this gives you a sense of the range of phenomena you might see when you plot these cost average over some range of examples as well as suggests the sorts of things you might try to do in response to seeing different plots." "So if the plots looks too noisy, or if it wiggles up and down too much, then try increasing the number of examples you're averaging over so you can see the overall trend in the plot better." "And if you see that the errors are actually increasing, the costs are actually increasing, try using a smaller value of alpha." "Finally, it's worth examining the of the learning rate just a little bit more." "We saw that when we run stochastic gradient descent, the algorithm will start here and sort of meander towards the minimum And then it won't really converge, and instead it'll wander around the minimum forever." "And so you end up with a parameter value that is hopefully close to the global minimum that won't be exact at the global minimum." "In most typical implementations of stochastic gradient descent the learning rate alpha is typically held constant." "And so what you we end up is exactly a picture like this." "If you want stochastic gradient descent to actually converge to the global minimum, there's one thing which you can do which is" "You can slowly decrease the learning rate alpha over time." "So, a pretty typical way of doing that would be to set alpha equals some constant one divided by iteration number plus constant 2." "So, iteration number is the number of iterations you've run of stochastic gradient descent, so it's really the number of training examples you've And counts" "1 and counts 2 are additional parameters that you have that you might have to play with a bit in order to get good performance." "reasons people tend not to do this is because you end up needing to spend time playing with these 2 extra parameters, constant" "1 and constant 2, and so this makes the algorithm more finicky." "You know, it's just more parameters able to fiddle with in order to make the algorithm work well." "But if you manage to tune the parameters well, then the picture you can get is that the algorithm will actually around towards the minimum, but as it gets closer because you're decreasing the learning rate the meanderings will" "get smaller and smaller until it pretty much just to the global minimum." "I hope this makes sense, right?" "And the reason this formula makes sense is because as the algorithm runs, the iteration number becomes large So alpha will slowly become small, and so you take smaller and smaller steps until it sort of, you know, hopefully converges to the global minimum." "So If you do slowly decrease alpha to zero you can end up with a slightly better hypothesis." "But because of the extra work needed for the constants and because frankly you." "we're pretty happy with any parameter value that is, you know, pretty close to the global minimum." "Typically this process of decreasing alpha slowly is usually not done and Keeping the learning." "Alpha constant is the more common application of stochastic gradient descent although you will see people use either version." "To summarize in this video we talk about a way for approximately monitoring how the stochastic gradient descent is doing in terms for optimizing the cost function." "And this is a method that does not require scanning over the entire training set periodically to compute the cost function." "But instead it looks at say only the last thousand" "examples or so." "And you can use this method both to make sure the stochastic gradient descent is okay and is converging" "or to use it to tune the learning rate alpha." "In this video, I'd like to talk about a new large-scale machine learning setting called the online learning setting." "The online learning setting allows us to model problems where we have a continuous flood or a continuous stream of data coming in and we would like an algorithm to learn from that." "Today, many of the largest websites, or many of the largest website companies use different versions of online learning algorithms to learn from the flood of users that keep on coming to, back to the website." "Specifically, if you have a continuous stream of data generated by a continuous stream of users coming to your website, what you can do is sometimes use an online learning algorithm to learn user preferences from the stream of data and use that" "to optimize some of the decisions on your website." "Suppose you run a shipping service, so, you know, users come and ask you to help ship their package from location A to location B and suppose you run a website, where users repeatedly come and they" "tell you where they want to send the package from, and where they want to send it to" "(so the origin and destination) and your website offers to ship the package for some asking price, so I'll ship your package for $50," "I'll ship it for $20." "And based on the price that you offer to the users, the users sometimes chose to use a shipping service;" "that's a positive example and sometimes they go away and they do not choose to purchase your shipping service." "So let's say that we want a learning algorithm to help us to optimize what is the asking price that we want to offer to our users." "And specifically, let's say we come up with some sort of features that capture properties of the users." "If we know anything about the demographics, they capture, you know, the origin and destination of the package, where they want to ship the package." "And what is the price that we offer to them for shipping the package." "and what we want to do is learn what is the probability that they will elect to ship the package, using our shipping service given these features, and again just as a reminder these features X also captures the price that we're asking for." "And so if we could estimate the chance that they'll agree to use our service for any given price, then we can try to pick a price so that they have a pretty high probability of choosing our website while simultaneously" "hopefully offering us a fair return, offering us a fair profit for shipping their package." "So if we can learn this property of y equals 1 given any price and given the other features we could really use this to choose appropriate prices as new users come to us." "So in order to model the probability of y equals 1, what we can do is use logistic regression or neural network or some other algorithm like that." "But let's start with logistic regression." "Now if you have a website that just runs continuously, here's what an online learning algorithm would do." "I'm gonna write repeat forever." "This just means that our website is going to, you know, keep on staying up." "What happens on the website is occasionally a user will come and for the user that comes we'll get some x,y pair corresponding to a customer or to a user on the website." "So the features x are, you know, the origin and destination specified by this user and the price that we happened to offer to them this time around, and y is either one or zero depending one whether or" "not they chose to use our shipping service." "Now once we get this {x,y} pair, what an online learning algorithm does is then update the parameters theta using just this example x,y, and in particular we would update my parameters theta" "as Theta j get updated as Theta j minus the learning rate alpha times my usual gradient descent rule for logistic regression." "So we do this for j equals zero up to n, and that's my close curly brace." "So, for other learning algorithms instead of writing X-Y, right, I was writing things like Xi," "Yi but in this online learning setting where actually discarding the notion of there being a fixed training set instead we have an algorithm." "Now what happens as we get an example and then we learn using that example like so and then we throw that example away." "We discard that example and we never use it again and so that's why we just look at one example at a time." "We learn from that example." "We discard it." "Which is why, you know, we're also doing away with this notion of there being this sort of fixed training set indexed by i." "And, if you really run a major website where you really have a continuous stream of users coming, then this sort of online learning algorithm is actually a pretty reasonable algorithm." "Because of data is essentially free if you have so much data, that data is essentially unlimited then there is really may be no need to look at a training example more than once." "Of course if we had only a small number of users then rather than using an online learning algorithm like this, you might be better off saving away all your data in a fixed training set and then running some algorithm over that training set." "But if you really have a continuous stream of data, then an online learning algorithm can be very effective." "I should mention also that one interesting effect of this sort of online learning algorithm is that it can adapt to changing user preferences." "And in particular, if over time because of changes in the economy maybe users start to become more price sensitive and willing to pay, you know, less willing to pay high prices." "Or if they become less price sensitive and they're willing to pay higher prices." "Or if different things become more important to users, if you start to have new types of users coming to your website." "This sort of online learning algorithm can also adapt to changing user preferences and kind of keep track of what your changing population of users may be willing to pay for." "And it does that because if your pool of users changes, then these updates to your parameters theta will just slowly adapt your parameters to whatever your latest pool of users looks like." "Here's another example of a sort of application to which you might apply online learning." "this is an application in product search in which we want to apply learning algorithm to learn to give good search listings to a user." "Let's say you run an online store that sells phones - that sells mobile phones or sells cell phones." "And you have a user interface where a user can come to your website and type in the query like "Android phone 1080p camera"." "So 1080p is a type of a specification for a video camera that you might have on a phone, a cell phone, a mobile phone." "Suppose, suppose we have a hundred phones in our store." "And because of the way our website is laid out, when a user types in a query, if it was a search query, we would like to find a choice of ten different phones to show what to offer to the user." "What we'd like to do is have a learning algorithm help us figure out what are the ten phones out of the 100 we should return the user in response to a user-search query like the one here." "Here's how we can go about the problem." "For each phone and given a specific user query; we can construct a feature vector" "X. So the feature vector X might capture different properties of the phone." "It might capture things like, how similar the user search query is in the phones." "We capture things like how many words in the user search query match the name of the phone, how many words in the user search query match the description of the phone and so on." "So the features x capture properties of the phone and it captures things about how similar or how well the phone matches the user query along different dimensions." "What we like to do is estimate the probability that a user will click on the link for a specific phone, because we want to show the user phones that they are likely to want to buy, want to show the user" "phones that they have high probability of clicking on in the web browser." "So I'm going to define y equals one if the user clicks on the link for a phone and y equals zero otherwise and what I would like to do is learn the probability the user will click on a specific" "phone given, you know, the features x, which capture properties of the phone and how well the query matches the phone." "To give this problem a name in the language of people that run websites like this, the problem of learning this is actually called the problem of learning the predicted click-through rate, the predicted CTR." "It just means learning the probability that the user will click on the specific link that you offer them, so CTR is an abbreviation for click through rate." "And if you can estimate the predicted click-through rate for any particular phone, what we can do is use this to show the user the ten phones that are most likely to click on, because out of the hundred phones," "we can compute this for each of the 100 phones and just select the 10 phones that the user is most likely to click on, and this will be a pretty reasonable way to decide what ten results to show to the user." "Just to be clear, suppose that every time a user does a search, we return ten results what that will do is it will actually give us ten x,y pairs, this actually gives us ten training examples every" "time a user comes to our website because, because for the ten phone that we chose to show the user, for each of those 10 phones we get a feature vector X, and for each of those 10 phones we" "show the user we will also get a value for y, we will also observe the value of y, depending on whether or not we clicked on that url or not and so, one way to run a" "website like this would be to continuously show the user, you know, your ten best guesses for what other phones they might like and so, each time a user comes you would get ten examples, ten x,y pairs," "and then use an online learning algorithm to update the parameters using essentially 10 steps of gradient descent on these" "10 examples, and then you can throw the data away, and if you really have a continuous stream of users coming to your website, this would be a pretty reasonable way to learn parameters for your algorithm so as to show the ten phones" "to your users that may be most promising and the most likely to click on." "So, this is a product search problem or learning to rank phones, learning to search for phones example." "So, I'll quickly mention a few others." "One is, if you have a website and you're trying to decide, you know, what special offer to show the user, this is very similar to phones, or if you have a website and you show different users different news articles." "So, if you're a news aggregator website, then you can again use a similar system to select, to show to the user, you know, what are the news articles that they are most likely to be interested" "in and what are the news articles that they are most likely to click on." "Closely related to special offers, will we profit from recommendations." "And in fact, if you have a collaborative filtering system, you can even imagine a collaborative filtering system giving you additional features to feed into a logistic regression classifier to try to predict the click through rate for different products that you might recommend to a user." "Of course, I should say that any of these problems could also have been formulated as a standard machine learning problem, where you have a fixed training set." "Maybe, you can run your website for a few days and then save away a training set, a fixed training set, and run a learning algorithm on that." "But these are the actual sorts of problems, where you do see large companies get so much data, that there's really maybe no need to save away a fixed training set, but instead you can use an online learning algorithm to just learn continuously." "from the data that users are generating on your website." "So, that was the online learning setting and as we saw, the algorithm that we apply to it is really very similar to this schotastic gradient descent algorithm, only instead of scanning through a fixed training set, we're instead getting" "one example from a user, learning from that example, then discarding it and moving on." "And if you have a continuous stream of data for some application, this sort of algorithm may be well worth considering for your application." "And of course, one advantage of online learning is also that if you have a changing pool of users, or if the things you're trying to predict are slowly changing like your user taste is slowly changing, the online" "learning algorithm can slowly adapt your learned hypothesis to whatever the latest sets of user behaviors are like as well." "In the last few videos, we talked about stochastic gradient descent, and, you know, other variations of the stochastic gradient descent algorithm, including those adaptations to online learning, but all of those algorithms could be run on" "one machine, or could be run on one computer." "And some machine learning problems are just too big to run on one machine, sometimes maybe you just so much data you just don't ever want to run all that data through a single computer, no matter what algorithm you would use on that computer." "So in this video I'd like to talk about different approach to large scale machine learning, called the map reduce approach." "And even though we have quite a few videos on stochastic gradient descent and we're going to spend relative less time on map reduce--don't judge the relative importance of map reduce versus the gradient descent based on the amount amount of" "time I spend on these ideas in particular." "Many people will say that map reduce is at least an equally important, and some would say an even more important idea compared to gradient descent, only it's relatively simpler to explain, which is why I'm" "going to spend less time on it, but using these ideas you might be able to scale learning algorithms to even far larger problems than is possible using stochastic gradient descent." "Here's the idea." "Let's say we want to fit a linear regression model or a logistic regression model or some such, and let's start again with batch gradient descent, so that's our batch gradient descent learning rule." "And to keep the writing on this slide tractable, I'm going to assume throughout that we have m equals 400 examples." "Of course, by our standards, in terms of large scale machine learning, you know m might be pretty small and so, this might be more commonly applied to problems, where you have maybe closer to 400" "million examples, or some such, but just to make the writing on the slide simpler, I'm going to pretend we have 400 examples." "So in that case, the batch gradient descent learning rule has this 400 and the sum from i equals 1 through" "400 through my 400 examples here, and if m is large, then this is a computationally expensive step." "So, what the MapReduce idea does is the following, and" "I should say the map reduce idea is due to two researchers, Jeff Dean and Sanjay Gimawat." "Jeff Dean, by the way, is one of the most legendary engineers in all of Silicon Valley and he kind of built a large fraction of the architectural infrastructure that all of Google runs on today." "But here's the map reduce idea." "So, let's say I have some training set, if we want to denote by this box here of X Y pairs," "where it's X1, Y1, down to my 400 examples," "Xm, Ym." "So, that's my training set with 400 training examples." "In the MapReduce idea, one way to do, is split this training set in to different subsets." "I'm going to." "assume for this example that" "I have 4 computers, or 4 machines to run in parallel on my training set, which is why I'm splitting this into 4 machines." "If you have 10 machines or 100 machines, then you would split your training set into 10 pieces or 100 pieces or what have you." "And what the first of my 4 machines is to do, say, is use just the first one quarter of my training set--so use just the first 100 training examples." "And in particular, what it's going to do is look at this summation, and compute that summation for just the first 100 training examples." "So let me write that up" "I'm going to compute a variable temp 1 to superscript 1 the first machine J equals" "sum from equals 1 through 100, and then I'm going to plug in exactly that term there--so I have" "X-theta, Xi, minus Yi times Xij, right?" "So that's just that gradient descent term up there." "And then similarly, I'm going to take the second quarter of my data and send it to my second machine, and my second machine will use training examples 101 through 200 and you will compute similar variables of a temp to j which" "is the same sum for index from examples 101 through 200." "And similarly machines 3 and 4 will use the third quarter and the fourth quarter of my training set." "So now each machine has to sum over 100 instead of over 400 examples and so has to do only a quarter of the work and thus presumably it could do it about four times as fast." "Finally, after all these machines have done this work, I am going to take these temp variables" "and put them back together." "So I take these variables and send them all to a You know centralized master server and what the master will do is combine these results together." "and in particular, it will update my parameters theta j according to theta j gets updated as theta j" "minus Of the learning rate alpha times one over 400 times temp," "1, J, plus temp 2j plus temp 3j" "plus temp 4j and of course we have to do this separately for J equals 0." "You know, up to and within this number of features." "So operating this equation into I hope it's clear." "So what this equation" "is doing is exactly the same is that when you have a centralized master server that takes the results, the ten one j the ten two j ten three j and ten four j and adds them up and so of course the sum" "of these four things." "Right, that's just the sum of this, plus the sum of this, plus the sum of this, plus the sum of that, and those four things just add up to be equal to this sum that" "we're originally computing a batch stream descent." "And then we have the alpha times 1 of 400, alpha times 1 of 100, and this is exactly equivalent to the batch gradient descent algorithm, only, instead of needing to sum over all four hundred training" "examples on just one machine, we can instead divide up the work load on four machines." "So, here's what the general picture of the MapReduce technique looks like." "We have some training sets, and if we want to paralyze across four machines, we are going to take the training set and split it, you know, equally." "Split it as evenly as we can into four subsets." "Then we are going to take the 4 subsets of the training data and send them to 4 different computers." "And each of the 4 computers can compute a summation over just one quarter of the training set, and then finally take each of the computers takes the results, sends them to a centralized server, which then combines the results together." "So, on the previous line in that example, the bulk of the work in gradient descent, was computing the sum from i equals 1 to 400 of something." "So more generally, sum from i equals 1 to m of that formula for gradient descent." "And now, because each of the four computers can do just a quarter of the work, potentially you can get up to a 4x speed up." "In particular, if there were no network latencies and no costs of the network communications to send the data back and forth, you can potentially get up to a 4x speed up." "Of course, in practice, because of network latencies, the overhead of combining the results afterwards and other factors, in practice you get slightly less than a 4x speedup." "But, none the less, this sort of macro juice approach does offer us a way to process much larger data sets than is possible using a single computer." "If you are thinking of applying" "Map Reduce to some learning algorithm, in order to speed this up." "By paralleling the computation over different computers, the key question to ask yourself is, can your learning algorithm be expressed as a summation over the training set?" "And it turns out that many learning algorithms can actually be expressed as computing sums of functions over the training set and the computational expense of running them on large data sets is because they need to sum over a very large training set." "So, whenever your learning algorithm can be expressed as a sum of the training set and whenever the bulk of the work of the learning algorithm can be expressed as the sum of the training set, then mac reviews might a good candidate" "for scaling your learning algorithms through very, very good data sets." "Lets just look at one more example." "Let's say that we want to use one of the advanced optimization algorithm." "So, things like, you know, I, b, f, g, s constant gradient and so on, and let's say we want to train a logistic regression of the algorithm." "For that, we need to compute two main quantities." "One is for the advanced optimization algorithms like, you know, LPF and constant gradient." "We need to provide it a routine to compute the cost function of the optimization objective." "And so for logistic regression, you remember that a cost function has this sort of sum over the training set, and so if youre paralizing over ten machines, you would split up the training set onto ten machines and have each" "of the ten machines compute the sum of this quantity over just one tenth of the training" "data." "Then, the other thing that the advanced optimization algorithms need, is a routine to compute these partial derivative terms." "Once again, these derivative terms, for which it's a logistic regression, can be expressed as a sum over the training set, and so once again, similar to our earlier example, you would have each machine compute that summation" "over just some small fraction of your training data." "And finally, having computed all of these things, they could then send their results to a centralized server, which can then add up the partial sums." "This corresponds to adding up those tenth i or tenth ij variables, which were computed locally on machine number i, and so the centralized server can sum these things up and get the overall cost function and get the overall partial derivative," "which you can then pass through the advanced optimization algorithm." "So, more broadly, by taking other learning algorithms and expressing them in sort of summation form or by expressing them in terms of computing sums of functions over the training set, you can use the MapReduce technique to" "parallelize other learning algorithms as well, and scale them to very large training sets." "Finally, as one last comment, so far we have been discussing MapReduce algorithms as allowing you to parallelize over multiple computers, maybe multiple computers in a computer cluster or over multiple computers in the data center." "It turns out that sometimes even if you have just a single computer," "MapReduce can also be applicable." "In particular, on many single computers now, you can have multiple processing cores." "You can have multiple CPUs, and within each CPU you can have multiple proc cores." "If you have a large training set, what you can do if, say, you have a computer with 4 computing cores, what you can do is, even on a single computer you can split the training sets into pieces and" "send the training set to different cores within a single box, like within a single desktop computer or a single server and use" "MapReduce this way to divvy up work load." "Each of the cores can then carry out the sum over, say, one quarter of your training set, and then they can take the partial sums and combine them, in order to get the summation over the entire training set." "The advantage of thinking about MapReduce this way, as paralyzing over cause within a single machine, rather than parallelizing over multiple machines is that, this way you don't have to worry about network latency, because all the communication, all the" "sending of the [xx] back and forth, all that happens within a single machine." "And so network latency becomes much less of an issue compared to if you were using this to over different computers within the data sensor." "Finally, one last caveat on parallelizing within a multi-core machine." "Depending on the details of your implementation, if you have a multi-core machine and if you have certain numerical linear algebra libraries." "It turns out that the sum numerical linear algebra libraries that can automatically parallelize their linear algebra operations across multiple cores within the machine." "So if you're fortunate enough to be using one of those numerical linear algebra libraries and certainly this does not apply to every single library." "If you're using one of those libraries and." "If you have a very good vectorizing implementation of the learning algorithm." "Sometimes you can just implement you standard learning algorithm in a vectorized fashion and not worry about parallelization and numerical linear algebra libararies" "could take care of some of it for you." "So you don't need to implement [xx] but." "for other any problems, taking advantage of this sort of map reducing commentation, finding and using this" "MapReduce formulation and to paralelize a cross coarse except yourself might be a good idea as well and could let you speed up your learning algorithm." "In this video, we talked about the MapReduce approach to parallelizing machine learning by taking a data and spreading them across many computers in the data center." "Although these ideas are critical to paralysing across multiple cores within a single computer" "as well." "Today there are some good open source implementations of MapReduce, so there are many users in open source system called" "Hadoop and using either your own implementation or using someone else's open source implementation, you can use these ideas to parallelize learning algorithms and get them to run on much larger data sets than is possible using just a single machine." "In this and the next few videos, I want to tell you about a machine learning application example, or a machine learning application history centered around an application called Photo OCR ." "There are three reasons why I want to do this, first I wanted to show you an example of how a complex machine learning system can be put together." "Second, once told the concepts of a machine learning a type line and how to allocate resources when you're trying to decide what to do next." "And this can either be in the context of you working by yourself on the big application Or it can be the context of a team of developers trying to build a complex application together." "And then finally, the Photo" "OCR problem also gives me an excuse to tell you about just a couple more interesting ideas for machine learning." "One is some ideas of how to apply machine learning to computer vision problems, and second is the idea of artificial data synthesis, which we'll see in a couple of videos." "So, let's start by talking about what is the Photo OCR problem." "Photo OCR stands for" "Photo Optical Character Recognition." "With the growth of digital photography and more recently the growth of camera in our cell phones we now have tons of visual pictures that we take all over the place." "And one of the things that has interested many developers is how to get our computers to understand the content of these pictures a little bit better." "The photo OCR problem focuses on how to get computers to read the text to the purest in images that we take." "Given an image like this it might be nice if a computer can read the text in this image so that if you're trying to look for this picture again you type in the words, lulu bees and and have it automatically pull" "up this picture, so that you're not spending lots of time digging through your photo collection Maybe hundreds of thousands of pictures in." "The Photo OCR problem does exactly this, and it does so in several steps." "First, given the picture it has to look through the image and detect where there is text in the picture." "And after it has done that or if it successfully does that it then has to look at these text regions and actually read the text in those regions, and hopefully if it reads it correctly, it'll come" "up with these transcriptions of what is the text that appears in the image." "Whereas OCR, or optical character recognition of scanned documents is relatively easier problem, doing OCR from photographs today is still a very difficult machine learning problem, and you can do this." "Not only can this help our computers to understand the content of our though images better, there are also applications like helping blind people, for example, if you could provide to a blind person a camera that can look" "at what's in front of them, and just tell them the words that my be on the street sign in front of them." "With car navigation systems." "For example, imagine if your car could read the street signs and help you navigate to your destination." "In order to perform photo OCR, here's what we can do." "First we can go through the image and find the regions where there's text and image." "So, shown here is one example of text and image that the photo OCR system may find." "Second, given the rectangle around that text region, we can then do character segmentation, where we might take this text box that says "Antique Mall" and try to segment it out into the locations of the individual characters." "And finally, having segmented out into individual characters, we can then run a crossfire, which looks at the images of the visual characters, and tries to figure out the first character's an" "A, the second character's an" "N, the third character is a T, and so on, so that up by doing all this how that hopefully you can then figure out that this phrase is Rulegee's antique mall and similarly for some of" "the other words that appear in that image." "I should say that there are some photo OCR systems that do even more complex things, like a bit of spelling correction at the end." "So if, for example, your character segmentation and character classification system tells you that it sees the word c 1 e a n i n g." "Then, you know, a sort of spelling correction system might tell you that this is probably the word 'cleaning', and your character classification algorithm had just mistaken the l for a 1." "But for the purpose of what we want to do in this video, let's ignore this last step and just focus on the system that does these three steps of text detection, character segmentation, and character classification." "A system like this is what we call a machine learning pipeline." "In particular, here's a picture showing the photo OCR pipeline." "We have an image, which then fed to the text detection system text regions, we then segment out the characters--the individual characters in the text--and then finally we recognize the individual characters." "In many complex machine learning systems, these sorts of pipelines are common, where you can have multiple modules--in this example, the text detection, character segmentation, character recognition modules--each of which may be machine learning component," "or sometimes it may not be a machine learning component but to have a set of modules that act one after another on some piece of data in order to produce the output you want, which in the photo OCR example" "is to find the transcription of the text that appeared in the image." "If you're designing a machine learning system one of the most important decisions will often be what exactly is the pipeline that you want to put together." "In other words, given the photo" "OCR problem, how do you break this problem down into a sequence of different modules." "And you design the pipeline and each the performance of each of the modules in your pipeline." "will often have a big impact on the final performance of your algorithm." "If you have a team of engineers working on a problem like this is also very common to have different individuals work on different modules." "So I could easily imagine tech easily being the of anywhere from 1 to 5 engineers, character segmentation maybe another" "1-5 engineers, and character recognition being another 1-5" "engineers, and so having a pipeline like often offers a natural way to divide up the workload amongst different members of an engineering team, as well." "Although, or course, all of this work could also be done by just one person if that's how you want to do it." "In complex machine learning systems the idea of a pipeline, of a machine of a pipeline, is pretty pervasive." "And what you just saw is a specific example of how a Photo OCR pipeline might work." "In the next few videos I'll tell you a little bit more about this pipeline, and we'll continue to use this as an example to illustrate--I think--a few more key concepts of machine learning." "In the previous video, we talked about the photo OCR pipeline and how that worked." "In which we would take an image and pass the Through a sequence of machine learning components in order to try to read the text that appears in an image." "In this video I like to." "A little bit more about how the individual components of the pipeline works." "In particular most of this video will center around the discussion." "of whats called a sliding windows." "The first stage of the filter was the" "Text detection where we look at an image like this and try to find the regions of text that appear in this image." "Text detection is an unusual problem in computer vision." "Because depending on the length of the text you're trying to find, these rectangles that you're trying to find can have different aspect." "So in order to talk about detecting things in images let's start with a simpler example of pedestrian detection and we'll then later go back to." "Ideas that were developed in pedestrian detection and apply them to text detection." "So in pedestrian detection you want to take an image that looks like this and the whole idea is the individual pedestrians that appear in the image." "So there's one pedestrian that we found, there's a second one, a third one a fourth one, a fifth one." "And a one." "This problem is maybe slightly simpler than text detection just for the reason that the aspect ratio of most pedestrians are pretty similar." "Just using a fixed aspect ratio for these rectangles that we're trying to find." "So by aspect ratio I mean the ratio between the height and the width of these rectangles." "They're all the same." "for different pedestrians but for text detection the height and width ratio is different for different lines of text" "Although for pedestrian detection, the pedestrians can be different distances away from the camera and so the height of these rectangles can be different depending on how far away they are." "but the aspect ratio is the same." "In order to build a pedestrian detection system here's how you can go about it." "Let's say that we decide to standardize on this aspect ratio of 82 by 36 and we could have chosen some rounded number like 80 by 40 or something, but 82 by 36 seems alright." "What we would do is then go out and collect large training sets of positive and negative examples." "Here are examples of 82" "X 36 image patches that do contain pedestrians and here are examples of images that do not." "On this slide I show 12 positive examples of y1 and 12 examples of y0." "In a more typical pedestrian detection application, we may have anywhere from a 1,000 training examples up to maybe" "10,000 training examples, or even more if you can get even larger training sets." "And what you can do, is then train in your network or some other learning algorithm to take this input, an MS patch of dimension 82 by" "36, and to classify 'y' and to classify that image patch as either containing a pedestrian or not." "So this gives you a way of applying supervised learning in order to take an image patch can determine whether or not a pedestrian appears in that image capture." "Now, lets say we get a new image, a test set image like this and we want to try to find a pedestrian's picture image." "What we would do is start by taking a rectangular patch of this image." "Like that shown up here, so that's maybe a 82 X" "36 patch of this image, and run that image patch through our classifier to determine whether or not there is a pedestrian in that image patch, and hopefully our classifier will return y equals 0 for that patch, since there is no pedestrian." "Next, we then take that green rectangle and we slide it over a bit and then run that new image patch through our classifier to decide if there's a pedestrian there." "And having done that, we then slide the window further to the right and run that patch through the classifier again." "The amount by which you shift the rectangle over each time is a parameter, that's sometimes called the step size of the parameter, sometimes also called the slide parameter, and if you step this one pixel at a time." "So you can use the step size or stride of 1, that usually performs best, that is more cost effective, and so using a step size of maybe 4 pixels at a time, or eight pixels at a" "time or some large number of pixels might be more common, since you're then moving the rectangle a little bit more each time." "So, using this process, you continue stepping the rectangle over to the right a bit at a time and running each of these patches through a classifier, until eventually, as you slide this window over the different locations in the image," "first starting with the first row and then we go further rows in the image, you would then run all of these different image patches at some step size or some stride through your classifier." "Now, that was a pretty small rectangle, that would only detect pedestrians of one specific size." "What we do next is start to look at larger image patches." "So now let's take larger images patches, like those shown here and run those through the crossfire as well." "And by the way when I say take a larger image patch, what" "I really mean is when you take an image patch like this, what you're really doing is taking that image patch, and resizing it down to 82 X 36, say." "So you take this larger patch and re-size it to be smaller image and then it would be the smaller size image that is what you would pass through your classifier to try and decide if there is a pedestrian in that patch." "And finally you can do this at an even larger scales and run that side of Windows to the end And after this whole process hopefully your algorithm will detect whether theres pedestrian appears in the image, so thats how you train a" "the classifier, and then use a sliding windows classifier, or use a sliding windows detector in order to find pedestrians in the image." "Let's have a turn to the text detection example and talk about that stage in our photo OCR pipeline, where our goal is to find the text regions in unit." "similar to pedestrian detection you can come up with a label training set with positive examples and negative examples with examples corresponding to regions where text appears." "So instead of trying to detect pedestrians, we're now trying to detect texts." "And so positive examples are going to be patches of images where there is text." "And negative examples is going to be patches of images where there isn't text." "Having trained this we can now apply it to a new image, into a test" "set image." "So here's the image that we've been using as example." "Now, last time we run, for this example we are going to run a sliding windows at just one fixed scale just for purpose of illustration, meaning that" "I'm going to use just one rectangle size." "But lets say I run my little sliding windows classifier on lots of little image patches like this if I do that, what Ill end up with is a result like this where the white region show where my text detection system has found" "text and so the axis' of these two figures are the same." "So there is a region up here, of course also a region up here, so the fact that this black up here represents that the classifier does not think it's found any texts up there, whereas the" "fact that there's a lot of white stuff here, that reflects that classifier thinks that it's found a bunch of texts." "over there on the image." "What i have done on this image on the lower left is actually use white to show where the classifier thinks it has found text." "And different shades of grey correspond to the probability that was output by the classifier, so like the shades of grey corresponds to where it thinks it might have found text but has lower confidence the bright white response to whether the classifier," "up with a very high probability, estimated probability of there being pedestrians in that location." "We aren't quite done yet because what we actually want to do is draw rectangles around all the region where this text in the image, so were going to take one more step which is we take the output" "of the classifier and apply to it what is called an expansion operator." "So what that does is, it take the image here," "and it takes each of the white blobs, it takes each of the white regions and it expands that white region." "Mathematically, the way you implement that is, if you look at the image on the right, what we're doing to create the image on the right is, for every pixel we are going to ask, is it withing" "some distance of a white pixel in the left image." "And so, if a specific pixel is within, say, five pixels or ten pixels of a white pixel in the leftmost image, then we'll also color that pixel white in the rightmost image." "And so, the effect of this is, we'll take each of the white blobs in the leftmost image and expand them a bit, grow them a little bit, by seeing whether the nearby pixels, the white pixels," "and then coloring those nearby pixels in white as well." "Finally, we are just about done." "We can now look at this right most image and just look at the connecting components and look at the as white regions and draw bounding boxes around them." "And in particular, if we look at all the white regions, like this one, this one, this one, and so on, and if we use a simple heuristic to rule out rectangles whose aspect ratios look funny because we" "know that boxes around text should be much wider than they are tall." "And so if we ignore the thin, tall blobs like this one and this one, and we discard these ones because they are too tall and thin, and we then draw a the rectangles around the ones whose aspect" "ratio thats a height to what ratio looks like for text regions, then we can draw rectangles, the bounding boxes around this text region, this text region, and that text region, corresponding to the Lula B's antique mall logo," "the Lula B's, and this little open sign." "Of over there." "This example by the actually misses one piece of text." "This is very hard to read, but there is actually one piece of text there." "That says [xx] are corresponding to this but the aspect ratio looks wrong so we discarded that one." "So you know it's ok on this image, but in this particular example the classifier actually missed one piece of text." "It's very hard to read because there's a piece of text written against a transparent window." "So that's text detection using sliding windows." "And having found these rectangles with the text in it, we can now just cut out these image regions and then use later stages of pipeline to try to meet the texts." "Now, you recall that the second stage of pipeline was character segmentation, so given an image like that shown on top, how do we segment out the individual characters in this image?" "So what we can do is again use a supervised learning algorithm with some set of positive and some set of negative examples, what were going to do is look in the image patch and try to decide if there" "is split between two characters right in the middle of that image match." "So for initial positive examples." "This first cross example, this image patch looks like the middle of it is indeed" "the middle has splits between two characters and the second example again this looks like a positive example, because if I split two characters by putting a line right down the middle, that's the right thing to do." "So, these are positive examples, where the middle of the image represents a gap or a split" "between two distinct characters, whereas the negative examples, well, you know, you don't want to split two characters right in the middle, and so these are negative examples because they don't represent the midpoint between two characters." "So what we will do is, we will train a classifier, maybe using new network, maybe using a different learning algorithm, to try to classify between the positive and negative examples." "Having trained such a classifier, we can then run this on this sort of text that our text detection system has pulled out." "As we start by looking at that rectangle, and we ask," ""Gee, does it look like the middle of that green rectangle, does it look like the midpoint between two characters?"." "And hopefully, the classifier will say no, then we slide the window over and this is a one dimensional sliding window classifier, because were going to slide the window only in one straight line from left to right, theres no different rows here." "There's only one row here." "But now, with the classifier in this position, we ask, well, should we split those two characters or should we put a split right down the middle of this rectangle." "And hopefully, the classifier will output y equals one, in which case we will decide to draw a line down there, to try to split two characters." "Then we slide the window over again, optic process, don't close the gap, slide over again, optic says yes, do split there and so on, and we slowly slide the classifier over to the" "right and hopefully it will classify this as another positive example and so on." "And we will slide this window over to the right, running the classifier at every step, and hopefully it will tell us, you know, what are the right locations" "to split these characters up into, just split this image up into individual characters." "And so thats 1D sliding windows for character segmentation." "So, here's the overall photo OCR pipe line again." "In this video we've talked about the text detection step, where we use sliding windows to detect text." "And we also use a one-dimensional sliding windows to do character segmentation to segment out, you know, this text image in division of characters." "The final step through the pipeline is the character qualification step and that step you might already be much more familiar with the early videos on supervised learning where you can apply a standard supervised learning within maybe on your network or maybe something" "else in order to take it's input, an image like that and classify which alphabet or which 26 characters A to Z, or maybe we should have 36 characters if you have the numerical digits as well, the multi class" "classification problem where you take it's input and image contained a character and decide what is the character that appears in that image?" "So that was the photo OCR pipeline and how you can use ideas like sliding windows classifiers in order to put these different components to develop a photo OCR system." "In the next few videos we keep on using the problem of photo OCR to explore somewhat interesting issues surrounding building an application like this." "I've seen over and over that one of the most reliable ways to get a high performance machine learning system is to take a low bias learning algorithm and to train it on a massive training set." "But where did you get so much training data from?" "Turns out that the machine earnings there's a fascinating idea called artificial data synthesis, this doesn't apply to every single problem, and to apply to a specific problem, often takes some thought and innovation and insight." "But if this idea applies to your machine, only problem, it can sometimes be a an easy way to get a huge training set to give to your learning algorithm." "The idea of artificial data synthesis comprises of two variations, main the first is if we are essentially creating data from [xx], creating new data from scratch." "And the second is if we already have it's small label training set and we somehow have amplify that training set or use a small training set to turn that into a larger training set and in this video we'll go over both those ideas." "To talk about the artificial data synthesis idea, let's use the character portion of the photo OCR pipeline, we want to take it's input image and recognize what character it is." "If we go out and collect a large label data set, here's what it is and what it look like." "For this particular example, I've chosen a square aspect ratio." "So we're taking square image patches." "And the goal is to take an image patch and recognize the character in the middle of that image patch." "And for the sake of simplicity," "I'm going to treat these images as grey scale images, rather than color images." "It turns out that using color doesn't seem to help that much for this particular problem." "So given this image patch, we'd like to recognize that that's a" "T. Given this image patch, we'd like to recognize that it's an 'S'." "Given that image patch we would like to recognize that as an 'I' and so on." "So all of these, our examples of row images, how can we come up with a much larger training set?" "Modern computers often have a huge font library and if you use a word processing software, depending on what word processor you use, you might have all of these fonts and many, many more Already stored inside." "And, in fact, if you go different websites, there are, again, huge, free font libraries on the internet we can download many, many different types of fonts, hundreds or perhaps thousands of different fonts." "So if you want more training examples, one thing you can do is just take characters from different fonts" "and paste these characters against different random backgrounds." "So you might take this ---- and paste that c against a random background." "If you do that you now have a training example of an image of the character C." "So after some amount of work, you know this, and it is a little bit of work to synthisize realistic looking data." "But after some amount of work, you can get a synthetic training set like that." "Every image shown on the right was actually a synthesized image." "Where you take a font, maybe a random font downloaded off the web and you paste an image of one character or a few characters from that font against this other random background image." "And then apply maybe a little blurring operators -----of app finder, distortions that app finder, meaning just the sharing and scaling and little rotation operations and if you do that you get a synthetic training set, on what the one shown here." "And this is work, grade, it is, it takes thought at work, in order to make the synthetic data look realistic, and if you do a sloppy job in terms of how you create the synthetic data then it actually won't work well." "But if you look at the synthetic data looks remarkably similar to the real data." "And so by using synthetic data you have essentially an unlimited supply of training examples for artificial training synthesis And so, if you use this source synthetic data, you have essentially unlimited supply of label data to create a improvised learning algorithm" "for the character recognition problem." "So this is an example of artificial data synthesis where youre basically creating new data from scratch, you just generating brand new images from scratch." "The other main approach to artificial data synthesis is where you take a examples that you currently have, that we take a real example, maybe from real image, and you create additional data, so as to amplify your training set." "So here is an image of a compared to a from a real image, not a synthesized image, and" "I have overlayed this with the grid lines just for the purpose of illustration." "Actually have these ----." "So what you can do is then take this alphabet here, take this image and introduce artificial warpings[sp?" "]" "or artificial distortions into the image so they can take the image a and turn that into 16 new examples." "So in this way you can take a small label training set and amplify your training set to suddenly get a lot more examples, all of it." "Again, in order to do this for application, it does take thought and it does take insight to figure out what our reasonable sets of distortions, or whether these are ways that amplify and multiply your training set, and for" "the specific example of character recognition, introducing these warping seems like a natural choice, but for a different learning machine application, there may be different the distortions that might make more sense." "Let me just show one example from the totally different domain of speech recognition." "So the speech recognition, let's say you have audio clips and you want to learn from the audio clip to recognize what were the words spoken in that clip." "So let's see how one labeled training example." "So let's say you have one labeled training example, of someone saying a few specific words." "So let's play that audio clip here." "0 -1-2-3-4-5." "Alright, so someone counting from 0 to 5, and so you want to try to apply a learning algorithm to try to recognize the words said in that." "So, how can we amplify the data set?" "Well, one thing we do is introduce additional audio distortions into the data set." "So here I'm going to add background sounds to simulate a bad cell phone connection." "When you hear beeping sounds, that's actually part of the audio track, that's nothing wrong with the speakers, I'm going to play this now." "0-1-2-3-4-5." "Right, so you can listen to that sort of audio clip and recognize the sounds, that seems like another useful training example to have, here's another example, noisy background." "Zero, one, two, three four five you know of cars driving past, people walking in the background, here's another one, so taking the original clean audio clip so taking the clean audio of someone saying 0 1 2 3" "4 5 we can then automatically synthesize these additional training examples and thus amplify one training example into maybe four different training examples." "So let me play this final example, as well." "0-1 3-4-5 So by taking just one labelled example, we have to go through the effort to collect just one labelled example fall of the 01205, and by synthesizing additional distortions, by introducing different background sounds, we've now multiplied this one" "example into many more examples." "Much work by just automatically adding these different background sounds to the clean audio Just one word of warning about synthesizing" "data by introducing distortions: if you try to do this yourself, the distortions you introduce should be representative the source of noises, or distortions, that you might see in the test set." "So, for the character recognition example, you know, the working things begin introduced are actually kind of reasonable, because an image" "A that looks like that, that's, could be an image that we could actually see in a test set.Reflect a fact And, you know, that image on the upper-right, that could be an image that we could imagine seeing." "And for audio, well, we do wanna recognize speech, even against a bad self internal connection, against different types of background noise, and so for the audio, we're again synthesizing examples are actually representative of the sorts of" "examples that we want to classify, that we want to recognize correctly." "In contrast, usually it does not help perhaps you actually a meaning as noise to your data." "I'm not sure you can see this, but what we've done here is taken the image, and for each pixel, in each of these 4 images, has just added some random Gaussian noise to each pixel." "To each pixel, is the pixel brightness, it would just add some, you know, maybe Gaussian random noise to each pixel." "So it's just a totally meaningless noise, right?" "And so, unless you're expecting to see these sorts of pixel wise noise in your test set, this sort of purely random meaningless noise is less likely to be useful." "But the process of artificial data synthesis it is you know a little bit of an art as well and sometimes you just have to try it and see if it works." "But if you're trying to decide what sorts of distortions to add, you know, do think about what other meaningful distortions you might add that will cause you to generate additional training examples that are at least somewhat representative of the" "sorts of images you expect to see in your test sets." "Finally, to wrap up this video, I just wanna say a couple of words, more about this idea of getting loss of data via artificial data synthesis." "As always, before expending a lot of effort, you know, figuring out how to create artificial training" "examples, it's often a good practice is to make sure that you really have a low biased crossfire, and having a lot more training data will be of help." "And standard way to do this is to plot the learning curves, and make sure that you only have a low as well, high variance falsifier." "Or if you don't have a low bias falsifier, you know, one other thing that's worth trying is to keep increasing the number of features that your classifier has, increasing the number of hidden units in your network," "saying, until you actually have a low bias falsifier, and only then, should you put the effort into creating a large, artificial training set, so what you really want to avoid is to, you know, spend" "a whole week or spend a few months figuring out how to get a great artificially synthesized data set." "Only to realize afterward, that, you know, your learning algorithm, performance doesn't improve that much, even when you're given a huge training set." "So that's about my usual advice about of a testing that you really can make use of a large training set before spending a lot of effort going out to get that large training set." "Second is, when i'm working on machine learning problems, one question" "I often ask the team" "I'm working with, often ask my students, which is, how much work would it be to get 10 times as much date as we currently had." "When I face a new machine learning application very often I will sit down with a team and ask exactly this question," "I've asked this question over and over and over and I've been very surprised how often this answer has been that." "You know, it's really not that hard, maybe a few days of work at most, to get ten times as much data as we currently have for a machine running application and very often if you can get" "ten times as much data there will be a way to make your algorithm do much better." "So, you know, if you ever join the product team" "working on some machine learning application product this is a very good questions ask yourself ask the team don't be too surprised if after a few minutes of brainstorming if your team comes up with a way to get literally ten" "times this much data, in which case, I think you would be a hero to that team, because with 10 times as much data, I think you'll really get much better performance, just from learning from so much data." "So there are several waysand that comprised both the ideas of generating data from scratch using random fonts and so on." "As well as the second idea of taking an existing example and and introducing distortions that amplify to enlarge the training set A couple of other examples of ways to get a lot more data are to collect the data or to label them yourself." "So one useful calculation that" "I often do is, you know, how many minutes, how many hours does it take to get a certain number of examples, so actually sit down and figure out, you know, suppose it takes me ten seconds to" "label one example then and, suppose that, for our application, currently we have 1000 labeled examples examples so ten times as much of that would be if n were equal to ten thousand." "A second way to get a lot of data is to just collect the data and you label it yourself." "So what I mean by this is" "I will often set down and do a calculation to figure out how much time, you know just like how many hours" "will it take, how many hours or how many days will it take for me or for someone else to just sit down and collect ten times as much data, as we have currently, by collecting the data ourselves and labeling them ourselves." "So, for example, that, for our machine learning application, currently we have 1,000 examples, so M 1,000." "That what we do is sit down and ask, how long does it take me really to collect and label one example." "And sometimes maybe it will take you, you know ten seconds to label" "one new example, and so if I want 10 X as many examples, I'd do a calculation." "If it takes me 10 seconds to get one training example." "If I wanted to get 10 times as much data, then I need 10,000 examples." "So I do the calculation, how long is it gonna take to label, to manually label 10,000 examples, if it takes me 10 seconds to label 1 example." "So when you do this calculation, often I've seen many you would be surprised, you know, how little, or sometimes a few days at work, sometimes a small number of days of work, well I've seen many teams be very" "surprised that sometimes how little work it could be, to just get a lot more data, and let that be a way to give your learning app to give you a huge boost in performance, and necessarily, you" "know, sometimes when you've just managed to do this, you will be a hero and whatever product development, whatever team you're working on, because this can be a great way to get much better performance." "Third and finally, one sometimes good way to get a lot of data is to use what's now called crowd sourcing." "So today, there are a few websites or a few services that allow you to hire people on the web to, you know, fairly inexpensively label large training sets for you." "So this idea of crowd sourcing, or crowd sourced data labeling, is something that has, is obviously, like an entire academic literature, has some of it's own complications and so on, pertaining to labeler reliability." "Maybe, you know, hundreds of thousands of labelers, around the world, working fairly inexpensively to help label data for you, and that I've just had mentioned, there's this one alternative as well." "And probably Amazon Mechanical Turk systems is probably the most popular crowd sourcing option right now." "This is often quite a bit of work to get to work, if you want to get very high quality labels, but is sometimes an option worth considering as well." "If you want to try to hire many people, fairly inexpensively on the web, our labels launch miles of data for you." "So this video, we talked about the idea of artificial data synthesis of either creating new data from scratch, looking, using the ramming funds as an example, or by amplifying an existing training set, by taking" "existing label examples and introducing distortions to it, to sort of create extra label examples." "And finally, one thing that" "I hope you remember from this video this idea of if you are facing a machine learning problem, it is often worth doing two things." "One just a sanity check, with learning curves, that having more data would help." "And second, assuming that that's the case," "I will often seat down and ask yourself seriously: what would it take to get ten times as much creative data as you currently have, and not always, but sometimes, you may be surprised by how easy that" "turns out to be, maybe a few days, a few weeks at work, and that can be a great way to give your learning algorithm a huge boost in performance" "By now you've seen the anomaly detection algorithm and we've also talked about how to evaluate an anomaly detection algorithm." "It turns out, that when you're applying anomaly detection, one of the things that has a huge effect on how well it does, is what features you use, and what features you choose, to give the anomaly detection algorithm." "So in this video, what I'd like to do is say a few words, give some suggestions and guidelines for how to go about designing or selecting features give to an anomaly detection algorithm." "In our anomaly detection algorithm, one of the things we did was model the features using this sort of Gaussian distribution." "With xi to mu i, sigma squared i, lets say." "And so one thing that" "I often do would be to plot the data or the histogram of the data, to make sure that the data looks vaguely" "Gaussian before feeding it to my anomaly detection algorithm." "And, it'll usually work okay, even if your data isn't Gaussian, but this is sort of a nice sanitary check to run." "And by the way, in case your data looks non-Gaussian, the algorithms will often work just find." "But, concretely if I plot the data like this, and if it looks like a histogram like this, and the way to plot a histogram is to use the HIST, or the" "HIST command in Octave, but it looks like this, this looks vaguely Gaussian, so if my features look like this," "I would be pretty happy feeding into my algorithm." "But if i were to plot a histogram of my data, and it were to look like this well, this doesn't look at all like a bell shaped curve, this is a very asymmetric distribution, it has a peak way off to one side." "If this is what my data looks like, what I'll often do is play with different transformations of the data in order to make it look more Gaussian." "And again the algorithm will usually work okay, even if you don't." "But if you use these transformations to make your data more gaussian, it might work a bit better." "So given the data set that looks like this, what I might do is take a log transformation of the data and if i do that and re-plot the histogram, what I end up with in this particular example," "is a histogram that looks like this." "And this looks much more Gaussian, right?" "This looks much more like the classic bell shaped curve, that we can fit with some mean and variance paramater sigma." "So what I mean by taking a log transform, is really that if I have some feature x1 and then the histogram of x1 looks like this then I might take my feature x1 and replace it with log of x1 and this is" "my new x1 that I'll plot to the histogram over on the right, and this looks much more Guassian." "Rather than just a log transform some other things you can do, might be, let's say" "I have a different feature x2, maybe I'll replace that will log x plus 1, or more generally with log" "x with x2 and some constant c and this constant could be something that I play with, to try to make it look as Gaussian as possible." "Or for a different feature x3, maybe" "I'll replace it with x3," "I might take the square root." "The square root is just x3 to the power of one half, right?" "And this one half is another example of a parameter I can play with." "So, I might have x4 and maybe I might instead replace that with x4 to the power of something else, maybe to the power of 1/3." "And these, all of these, this one, this exponent parameter, or the" "C parameter, all of these are examples of parameters that you can play with in order to make your data look a little bit more Gaussian." "So, let me show you a live demo of how I actually go about playing with my data to make it look more Gaussian." "So, I have already loaded in to octave here a set of features x I have a thousand examples loaded over there." "So let's pull up the histogram of my data." "Use the hist x command." "So there's my histogram." "By default, I think this uses 10 bins of histograms, but I want to see a more fine grid histogram." "So we do hist to the x, 50, so, this plots it in 50 different bins." "Okay, that looks better." "Now, this doesn't look very Gaussian, does it?" "So, lets start playing around with the data." "Lets try a hist of x to the 0.5." "So we take the square root of the data, and plot that histogram." "And, okay, it looks a little bit more Gaussian, but not quite there, so let's play at the 0.5 parameter." "Let's see." "Set this to 0.2." "Looks a little bit more Gaussian." "Let's reduce a little bit more 0.1." "Yeah, that looks pretty good." "I could actually just use 0.1." "Well, let's reduce it to 0.05." "And, you know?" "Okay, this looks pretty Gaussian, so I can define a new feature which is x mu equals x to the 0.05, and now my new feature x Mu looks more" "Gaussian than my previous one and then I might instead use this new feature to feed into my anomaly detection algorithm." "And of course, there is more than one way to do this." "You could also have hist of log of x, that's another example of a transformation you can use." "And, you know, that also look pretty Gaussian." "So, I can also define x mu equals log of x." "and that would be another pretty good choice of a feature to use." "So to summarize, if you plot a histogram with the data, and find that it looks pretty non-Gaussian, it's worth playing around a little bit with different transformations like these, to see if you can make" "your data look a little bit more" "Gaussian, before you feed it to your learning algorithm, although even if you don't, it might work okay." "But I usually do take this step." "Now, the second thing I want to talk about is, how do you come up with features for an anomaly detection algorithm." "And the way I often do so, is via an error analysis procedure." "So what I mean by that, is that this is really similar to the error analysis procedure that we have for supervised learning, where we would train a complete algorithm, and run the algorithm on a cross validation set," "and look at the examples it gets wrong, and see if we can come up with extra features to help the algorithm do better on the examples that it got wrong in the cross-validation set." "So lets try to reason through an example of this process." "In anomaly detection, we are hoping that p of x will be large for the normal examples and it will be small for the anomalous examples." "And so a pretty common problem would be if p of x is comparable, maybe both are large for both the normal and the anomalous examples." "Lets look at a specific example of that." "Let's say that this is my unlabeled data." "So, here I have just one feature, x1 and so I'm gonna fit a Gaussian to this." "And maybe my Gaussian that I fit to my data looks like that." "And now let's say I have an anomalous example, and let's say that my anomalous example takes on an x value of 2.5." "So I plot my anomalous example there." "And you know, it's kind of buried in the middle of a bunch of normal examples, and so," "just this anomalous example that I've drawn in green, it gets a pretty high probability, where it's the height of the blue curve, and the algorithm fails to flag this as an anomalous example." "Now, if this were maybe aircraft engine manufacturing or something, what" "I would do is, I would actually look at my training examples and look at what went wrong with that particular aircraft engine, and see, if looking at that example can inspire me to come up with a new feature" "x2, that helps to distinguish between this bad example, compared to the rest of my red examples, compared to all" "of my normal aircraft engines." "And if I managed to do so, the hope would be then, that, if I can create a new feature, X2, so that when I re-plot my data, if" "I take all my normal examples of my training set, hopefully" "I find that all my training examples are these red crosses here." "And hopefully, if I find that for my anomalous example, the feature x2 takes on the the unusual value." "So for my green example here, this anomaly, right, my" "X1 value, is still 2.5." "Then maybe my X2 value, hopefully it takes on a very large value like 3.5 over there," "or a very small value." "But now, if I model my data, I'll find that my anomaly detection algorithm gives high probability to data in the central regions, slightly lower probability to that, sightly lower probability to that." "An example that's all the way out there, my algorithm will now give very low probability" "to." "And so, the process of this is, really look at the" "mistakes that it is making." "Look at the anomaly that the algorithm is failing to flag, and see if that inspires you to create some new feature." "So find something unusual about that aircraft engine and use that to create a new feature, so that with this new feature it becomes easier to distinguish the anomalies from your good examples." "And so that's the process of error analysis" "and using that to create new features for anomaly detection." "Finally, let me share with you my thinking on how I usually go about choosing features for anomaly detection." "So, usually, the way I think about choosing features is" "I want to choose features that will take on either very, very large values, or very, very small values, for examples that I think might turn out to be anomalies." "So let's use our example again of monitoring the computers in a data center." "And so you have lots of machines, maybe thousands, or tens of thousands of machines in a data center." "And we want to know if one of the machines, one of our computers is acting up, so doing something strange." "So here are examples of features you may choose, maybe memory used, number of disc accesses, CPU load, network traffic." "But now, lets say that I suspect one of the failure cases, let's say that in my data set I think that CPU load the network traffic tend to grow linearly with each other." "Maybe I'm running a bunch of web servers, and so, here if one of my servers is serving a lot of users," "I have a very high CPU load, and have a very high network traffic." "But let's say, I think, let's say I have a suspicion, that one of the failure cases is if one of my computers has a job that gets stuck in some infinite loop." "So if I think one of the failure cases, is one of my machines, one of my web servers--server code-- gets stuck in some infinite loop, and so the CPU load grows, but the network traffic doesn't because" "it's just spinning it's wheels and doing a lot of CPU work, you know, stuck in some infinite loop." "In that case, to detect that type of anomaly, I might create a new feature, X5, which might be CPU load" "divided by network traffic." "And so here X5 will take on a unusually large value if one of the machines has a very large CPU load but not that much network traffic and so this will be a feature that will help your anomaly detection capture, a certain type of anomaly." "And you can also get creative and come up with other features as well." "Like maybe I have a feature x6 thats CPU load squared divided by network traffic." "And this would be another variant of a feature like x5 to try to capture anomalies where one of your machines has a very high CPU load, that maybe doesn't have a commensurately large network traffic." "And by creating features like these, you can start to capture" "anomalies that correspond to unusual combinations of values of the features." "So in this video we talked about how to and take a feature, and maybe transform it a little bit, so that it becomes a bit more Gaussian, before feeding into an anomaly detection algorithm." "And also the error analysis in this process of creating features to try to capture different types of anomalies." "And with these sorts of guidelines hopefully that will help you to choose good features, to give to your anomaly detection algorithm, to help it capture all sorts of anomalies." "in earlier videos, I have said over and over that, when you are developing machine learning system, one of the most valuable resources is your time as the developer in terms of picking what to work on next." "Or, you have a team of developers or a team of engineers working together on a machine learning system, again one of the most valuable resources is the time of the engineers or the developers working on the system." "And what you really want to avoid is that you or your colleagues or your friends spend a lot of time working on some component, only to realize after weeks or months of time spent, that all that work, you know, just doesn't" "make a huge difference on the performance of the final system." "In this video, what I'd like to to is, to talk about something called ceiling analysis." "When you or your team are working on a pipeline machine learning system, this can sometimes give you a very strong signal, a very strong guidance, on what parts of the pipeline might be the best use of your time to work on." "To talk about ceiling analysis, I'm going to keep on using the example of the photo" "OCR pipeline and I said earlier each of these boxes text detection, character segmentation, character recognition, each of these boxes can have even a small engineering team working on it, or maybe the entire system is just built" "by you, either way, but the question is, where should you allocate resources?" "Which of these boxes is most worth your efforts, trying to improve the performance of." "In order to explain the idea of ceiling analysis, I'm going to keep using the example of our photo OCR pipeline." "As I mentioned earlier, each of these boxes here, each of these machine learning components could be the work of even a small team of engineers, or maybe the whole system could be built by just one person." "But the question is, where should you allocate scarce resources?" "Now this, which of these components, or which one or two or maybe all three of these components is most worth your time to try to improve the performance of." "So here's the idea of ceiling analysis." "As in the development process for other machine learning systems as well, in order to make decisions on what to do for developing the system is going to be very helpful to have a single road number evaluation metric for this learning system." "So let's say we pick characters level accuracy." "So if, you know, given a test set image, while just a fraction of alphabets of characters in the testing image that" "we recognize correctly." "Or you can pick some other single world number evaluation metric, if you want, but let's say that whatever evaluation metric we pick, we get that, we find that the overall system currently has 72% accuracy." "So, in other words, we have some set of test set images and for each test set images, we run it through text section, then character 7 nation, then character recognition, and we find that on our test set, the" "overall accuracy of the entire system was 72% on one of the metric you chose." "Now just the idea behind sealing analysis which is that we're going to go to let see the first module of a machinery pipelines text detection." "And what we are going to do is we are going to monkey around with the test set." "We are going to go to the test set and for every test example we are just going to provide it the correct text detection outputs." "In other words, we are going to the test set and just manually tell the algorithm where the text is in each of the test examples." "So in other words, we are going to simulate what happens if we have a text detection system with a 100% accuracy, for the purpose" "of detecting text in an image." "And really the way you do that is very simple right, instead of letting your learning algorithm detect the text in the images." "You wouldn't say go to the images and just manually label what is the location of the text in my test set image." "And you would then let these correct, so let these ground true labels of where as the text be part of your text set and use these ground true labels what you feed in to the next stage of the pipeline, to the character segmentation pipeline." "So just said it again, by putting a checkmark over here, what I mean is Im going to go to my test set and just give it the correct answers, give it the correct labels, for the text detection part of the pipeline." "So that, as it, I have a perfect text detection system on my test One into do that run this data to the rest of five points paper presentation and counter definition." "And then, use the same evaluation metric as before, to measure what is the overall accuracy of the entire system." "And with perfect hopefully the performance goes up." "Let 's say it goes up 89% and then were going to keep going, next lets go to the next selection of pipeline, two character segmentation and again were going to go to my test." "And now going to give the correct text detection output and give the correct character segmentation outputs and" "manually label the correct segment orientations of text into individual characters." "And see how much that helps." "And let's say it goes up to 90% accuracy for the overall system." "Alright so as always the accuracy is." "Accuracy of the overall systems." "So whatever the final output of the character recognition system is." "Whatever the final output of the overall pipeline is, it's going to measure the accuracy of that." "And then finally like character recognition system and give that the correct label as well." "And if I do that too then, no surprise that I should get a 100% accuracy." "Now, the nice thing about having done this analysis analysis is we can now understand what is the upside potential for improving each of these components." "So we see that if we get perfect text detection." "Our performance went up from 72 to 89 percent, so that's' a 17 percent performance gain." "So this means that you've to take your current system you spend a lot of time improving text detection." "That means that we could potentially improve our system's performance by 17 percent." "This seems like it's well worth our while." "Whereas in contrast, when going from text detection When we gave it perfect character segmentation, performance went up only by one percent." "So, that's a more sobering message." "It means that no matter how much time you spend character segmentation," "maybe the upside potential is going to be pretty small, and maybe you do not want to have a large team of engineers working on character segmentation that this sort of analysis shows that even when you give it the" "perfect character segmentation, your performance goes up by only one percent." "So right there, this is really estimates." "What is the ceiling, or what's an upper bound on how much you can improve the performance of your system by working on one of these components?" "And finally, going for character, when we get better character recognition, the performance went up by ten percent." "So you know, again you can decide, is a ten percent improvement, how much is that working out?" "It tells you that maybe with more efforts spent on the last station of the pipeline, you can improve the performance of the systems as well." "Another way of thinking about this is that, by going through this sort of analysis you're trying to figure out, you know, what is the upside potential, of improving each of these components or how much could you possibly gain if" "one of these components became absolutely perfect and just really places an upper bound on the performance of that system." "So, the idea of ceiling analysis is pretty important." "Let me just illustrate this idea again, but with a different example but a more complex one." "Let's say that you want to do face recognition from images, so unless you want to look at the picture and recognize whether or not the person in this picture is a particular friend of yours, trying to recognize the person shown in this image." "This is a slightly artificial example." "This isn't actually how face recognition is done in practice, but I want to step through an example of what a pipeline might look like to give you another example of how a ceiling analysis process might look." "So, we have a camera image and let's say that we design a pipeline as follows." "Let's say the first thing you want to do is do pre-processing of the image, so let's take those images like I have shown on the upper right, and let's say we want to remove the background, so" "through pre-processing the background disappears." "Next we want to say detect the face of the person." "That's usually done with a learning algorithm." "So we'll run a sliding windows crossfire to draw a box around the person's face." "Having detected the face it turns out that if you want to recognize people it turns out that the eyes is a highly useful cue." "We actually, in terms ofrecognizing your friends, the appearance of their eyes is actually one of the most important cues that you use." "So let's run another crossfire to detect the eyes of the person." "So, segment out the eyes, and then and since this will give us useful features to recognize a person, and then other parts of the face of physical interest." "Maybe segment out the nose, segment out the mouth, and then, having found the eyes, the nose and the mouth, all of these give us useful features to maybe feed into a logistic regression crossfire." "And it's the job of the crossfire to then give us the overall label to find the label for who we think is the identity of this person." "So this is a kind of complicated pipeline." "It's actually probably more complicated than you should be using, if you actually want to recognize people." "But there's an illustrative example that's useful to think about for ceiling analysis." "So how do you go through ceiling analysis for this pipeline?" "Well, we'll step through these pieces one at a time." "Let's say your overall system has 85 percent accuracy, the first thing I do is go to my test set and manually give it a ground foreground, background, segmentations, and then manually go to the test set, and use Photoshop" "or something, to just tell it where's the background, and just manually remove the background, so ground true background, and see how much the accuracy changes." "In this example, the accuracy goes up by 0.1% so this is a strong sign that even if you had perfect background segmentation your performance, even if perfect background removal, the performance of your system isn't going to go up that much." "So this is maybe not worth a huge effort to work on pre-processing, on background removal." "Then, everything goes to the test set, given the correct face detection images, then again step through the eyes, nose, mouth segmentations in some order." "Pick one order." "Let's give the correct location of the eyes, correct location of the nose, correct location of the mouth, and then finally if I just give it the correct overall label, I get 100% accuracy." "And so, you know, as" "I go through the system and just give more and more components the correct labels in the test set, the performance" "So if the overall system goes up, and you can look at how much the performance went up on different steps, so, you know, from giving it the perfect face detection, and it looks like the overall" "performance of this system went up by 5.9 percent." "So that's a pretty big jump, means that maybe it's worth quite a bit of effort on better face detection." "Went four percent there, went one percent there, one percent there and three percent there." "So it looks like the components that most worth our while are, when" "I gave it perfect face detection, system went up." "By 5.9 performance, might give it perfect eye segmentation, went up by 4%, and then my final logistical crossfire, well there's another 3 percent gap there maybe." "And so, this tells us maybe one of the components that are most worth our while working on." "And by the way, I want to tell you, it's a true cautionary story." "The reason I put in this pre-processing background removal is because I actually know of a true story where there was a research team that actually literally had two people spend about a year and a half, spend 18 months, working on" "better background removal." "We are rushing here..." "I am obscuring the details for obvious reasons, but there was a computer vision application where there was a team of two engineers who literally spent I think about a year and a half, working on better background removal." "Actually they worked out really complicated algorithms, so I ended up publishing I think, one research paper." "But after all that work they found that, it just did not make a huge difference to the overall performance of the actual application they were working on." "And if only, you know if only someone were to do a [xx] analysis beforehand, maybe we could have realized this." "And one of them said to me afterward, you know, if only they had done the sort of analysis like this, maybe they could have realized before that 18 months of work, that they should have spent their effort focusing" "on some different component than literally spending 18 months working on background removal." "So to summarize, pipelines are pretty pervasive and complex machine learning applications." "And when you are working on a big machine learning application, I mean I think your time as a developer is so valuable." "So just don't waste your time working on something that ultimately isn't going to matter." "And in this video, we talked about this idea of ceiling analysis, which I've often found to be a very good tool for identifying the component, and if you actually put a focused effort on that component, and make a" "big difference, it would actually have a huge effect on the overall performance of your final system." "So, over the years, working with machine learning, I've actually learned to not trust my own gut feeling about what component to work on." "So, very often, when you have worked with machine learning for a long time, but often, our local machine learning problem, and I may have some gut feeling about, oh, let's, you know, jump on that component, and just spend more time on that." "That over the years that I have come to even trust my own gut feelings and knowing not to trust gut feelings that much and instead really have a solid machine learning problem, where it's possible to structure things." "To do a ceiling analysis often does a much better and much more reliable way for deciding where to put a focused effort into, to really improve this, the performance of some component and we kind of be sure that when" "do that it will actually have a huge effect on the final performance of your process system." "Welcome to this free online class on machine learning." "Machine learning is one of the most exciting recent technologies." "And in this class, you learn about the state of the art and also gain practice implementing and deploying these algorithms yourself." "You've probably use a learning algorithm dozens of times a day without knowing it." "Every time you use a web search engine like Google or Bing to search the internet, one of the reasons that works so well is because a learning algorithm, one implemented by Google or Microsoft, has learned how to rank web" "pages." "Every time you use Facebook or Apple's photo typing application and it recognizes your friends' photos, that's also machine learning." "Every time you read your email and your spam filter saves you from having to wade through tons of spam email, that's also a learning algorithm." "For me one of the reasons I'm excited is the AI dream of someday building machines as intelligent as you or me." "We're a long way away from that goal, but many AI researchers believe that the best way to towards that goal is through learning algorithms that try to mimic how the human brain learns." "I'll tell you a little bit about that too in this class." "In this class you learn about state-of-the-art machine learning algorithms." "But it turns out just knowing the algorithms and knowing the math isn't that much good if you don't also know how to actually get this stuff to work on problems that you care about." "So, we've also spent a lot of time developing exercises for you to implement each of these algorithms and see how they work fot yourself." "So why is machine learning so prevalent today?" "It turns out that machine learning is a field that had grown out of the field of AI, or artificial intelligence." "We wanted to build intelligent machines and it turns out that there are a few basic things that we could program a machine to do such as how to find the shortest path from A to B." "But for the most part we just did not know how to write AI programs to do the more interesting things such as web search or photo tagging or email anti-spam." "There was a realization that the only way to do these things was to have a machine learn to do it by itself." "So, machine learning was developed as a new capability for computers and today it touches many segments of industry and basic science." "For me, I work on machine learning and in a typical week I might end up talking to helicopter pilots, biologists, a bunch of computer systems people (so my colleagues here at Stanford) and averaging two or three times a week I get email from" "people in industry from Silicon Valley contacting me who have an interest in applying learning algorithms to their own problems." "This is a sign of the range of problems that machine learning touches." "There is autonomous robotics, computational biology, tons of things in Silicon Valley that machine learning is having an impact on." "Here are some other examples of machine learning." "There's database mining." "One of the reasons machine learning has so pervaded is the growth of the web and the growth of automation All this means that we have much larger data sets than ever before." "So, for example tons of Silicon Valley companies are today collecting web click data, also called clickstream data, and are trying to use machine learning algorithms to mine this data to understand the users better and to serve the users" "better, that's a huge segment of Silicon Valley right now." "Medical records." "With the advent of automation, we now have electronic medical records, so if we can turn medical records into medical knowledge, then we can start to understand disease better." "Computational biology." "With automation again, biologists are collecting lots of data about gene sequences, DNA sequences, and so on, and machines running algorithms are giving us a much better understanding of the human genome, and what it means to be human." "And in engineering as well, in all fields of engineering, we have larger and larger, and larger and larger data sets, that we're trying to understand using learning algorithms." "A second range of machinery applications is ones that we cannot program by hand." "So for example, I've worked on autonomous helicopters for many years." "We just did not know how to write a computer program to make this helicopter fly by itself." "The only thing that worked was having a computer learn by itself how to fly this helicopter. [Helicopter whirling]" "Handwriting recognition." "It turns out one of the reasons it's so inexpensive today to route a piece of mail across the countries, in the US and internationally, is that when you write an envelope like this, it turns out there's a learning" "algorithm that has learned how to read your handwriting so that it can automatically route this envelope on its way, and so it costs us a few cents to send this thing thousands of miles." "And in fact if you've seen the fields of natural language processing or computer vision, these are the fields of AI pertaining to understanding language or understanding images." "Most of natural language processing and most of computer vision today is applied machine learning." "Learning algorithms are also widely used for self- customizing programs." "Every time you go to" "Amazon or Netflix or iTunes Genius, and it recommends the movies or products and music to you, that's a learning algorithm." "If you think about it they have million users; there is no way to write a million different programs for your million users." "The only way to have software give these customized recommendations is to become learn by itself to customize itself to your preferences." "Finally learning algorithms are being used today to understand human learning and to understand the brain." "We'll talk about how researches are using this to make progress towards the big AI dream." "A few months ago, a student showed me an article on the top twelve IT skills." "The skills that information technology hiring managers cannot say no to." "It was a slightly older article, but at the top of this list of the twelve most desirable IT skills was machine learning." "Here at" "Stanford, the number of recruiters that contact me asking if I know any graduating machine learning students is far larger than the machine learning students we graduate each year." "So I think there is a vast, unfulfilled demand for this skill set, and this is a great time to be learning about machine learning, and I hope to teach you a lot about machine learning in this class." "In the next video, we'll start to give a more formal definition of what is machine learning." "And we'll begin to talk about the main types of machine learning problems and algorithms." "You'll pick up some of the main machine learning terminology, and start to get a sense of what are the different algorithms, and when each one might be appropriate." "What is machine learning?" "In this video we will try to define what it is and also try to give you a sense of when you want to use machine learning." "Even among machine learning practitioners there isn't a well accepted definition of what is and what isn't machine learning." "But let me show you a couple of examples of the ways that people have tried to define it." "Here's the definition of what is machine learning does to Arthur Samuel." "He defined machine learning as the field of study that gives computers the ability to learn without being explicitly programmed." "Samuel's claim to fame was that back in the 1950's, he wrote a checkers playing program." "And the amazing thing about this checkers playing program, was that Arthur Samuel himself, wasn't a very good checkers player." "But what he did was, he had to program for it to play 10's of 1000's of games against itself." "And by watching what sorts of board positions tended to lead to wins, and what sort of board positions tended to lead to losses." "The checkers playing program learns over time what are good board positions and what are bad board positions." "And eventually learn to play checkers better than Arthur Samuel himself was able to." "This was a remarkable result." "Although Samuel himself turned out not to be a very good checkers player." "But because the computer has the patience to play tens of thousands of games itself." "No human, has the patience to play that many games." "By doing this the computer was able to get so much checkers-playing experience that it eventually became a better checkers player than Arthur Samuel himself." "This is somewhat informal definition, and an older one." "Here's a slightly more recent definition by Tom" "Mitchell, who's a friend out of Carnegie Mellon." "So Tom defines machine learning by saying that, a well posed learning problem is defined as follows." "He says, a computer program is said to learn from experience E, with respect to some task T, and some performance measure P, if its performance on T as measured by P improves with experience E. I actually think he came up with this definition just to make it" "rhyme." "For the checkers playing example the experience e, will be the experience of having the program play 10's of 1000's of games against itself." "The task t, will be the task of playing checkers." "And the performance measure p, will be the probability that it wins the next game of checkers against some new opponent." "Throughout these videos, besides me trying to teach you stuff, I will occasionally ask you a question to make sure you understand the content." "Here's one, on top is a definition of machine learning by Tom" "Mitchell." "Let's say your email program watches which emails you do or do not flag as spam." "So in an email client like this you might click this spam button to report some email as spam, but not other emails and." "Based on which emails you mark as spam, so your e-mail program learns better how to filter spam e-mail." "What is the task T in this setting?" "In a few seconds, the video will pause." "And when it does so, you can use your mouse to select one of these four radio buttons to let, to let me know which of these four you think is the right answer to this question." "That might be a performance measure P. And so, our task performance on the task our system's performance on the task T, on the performance measure P will improve after the experience E. In this class I hope to teach you about various different types of" "learning algorithms." "There are several different types of learning algorithms." "The main two types are what we call supervised learning and unsupervised learning." "I'll define what these terms mean more in the next couple videos." "But it turns out that in supervised learning, the idea is that we're going to teach the computer how to do something, whereas in unsupervised learning we're going let it learn by itself." "Don't worry if these two terms don't make sense yet, in the next two videos I'm going to say exactly what these two types of learning are." "You will also hear other buzz terms such as reinforcement learning and recommender systems." "These are other types of machine learning algorithms that we'll talk about later but the two most used types of learning algorithms are probably supervised learning and unsupervised learning and I'll define them in the next two videos and we'll spend most of this class talking about these two types of" "learning algorithms." "It turns out one of the other things we'll spend a lot of time on in this class is practical advice for applying learning algorithms." "This is something that I feel pretty strongly about, and it's actually something that I don't know of any other university teaches." "Teaching about learning algorithms is like giving you a set of tools, and equally important or more important to giving you the tools is to teach you how to apply these tools." "I like to make an analogy to learning to become a carpenter." "Imagine that someone is teaching you how to be a carpenter and they say here's a hammer, here's a screwdriver, here's a saw, good luck." "Well, that's no good, right?" "You, you, you have all these tools, but the more important thing, is to learn how to use these tools properly." "There's a huge difference between, between people that know how to use these machines learning algorithms, versus people who don't know how to use these tools well." "Here in Silicon Valley where I live, when I go visit different companies even at the top Silicon Valley companies very often I see people are trying to apply machine learning algorithms to some problem and sometimes they have been going at it for six months." "But sometimes when I look at what they're doing I, I, I say, you know, I could have told them like, gee, I could have told you six months ago that you should be taking a learning algorithm and" "applying it in like the slightly modified way and your chance of success would have been much higher." "So what we're going to do in this class is actually spend a lot of time talking about how, if you actually tried to develop a machine learning system, how to make those best practices type decisions about the way in which you" "build your system so that when you're applying learning algorithm you're less likely to end up one of those people who end up pursuing some path for six months that, you know, someone else could have figured out it just wasn't gonna work at" "all and it's just a waste of time for six months." "So I'm actually going to spend a lot of the time teaching you those sorts of best practices in machine learning and" "AI and how to get this stuff to work and how we do it, how the best people do it in" "Silicon Valley and around the world." "I hope to make you one of the best people in knowing how to design and build serious machine learning and AI systems." "So, that's machine learning and these are the main topics I hope to teach." "In the next video, I'm going to define what is supervised learning and after that, what is unsupervised learning." "And also, start to talk about when you would use each of them." "In this video I am going to define what is probably the most common type of machine learning problem, which is supervised learning." "I'll define supervised learning more formally later, but it's probably best to explain or start with an example of what it is and we'll do the formal definition later." "Let's say you want to predict housing prices." "A while back, a student collected data sets from the" "Institute of Portland Oregon." "And let's say you plot a data set and it looks like this." "Here on the horizontal axis, the size of different houses in square feet, and on the vertical axis, the price of different houses in thousands of dollars." "So." "Given this data, let's say you have a friend who owns a house that is, say 750 square feet and hoping to sell the house and they want to know how much they can get for the house." "So how can the learning algorithm help you?" "One thing a learning algorithm might be able to do is put a straight line through the data or to fit a straight line to the data and, based on that, it looks like maybe the house can be" "sold for maybe about $150,000." "But maybe this isn't the only learning algorithm you can use." "There might be a better one." "For example, instead of sending a straight line to the data, we might decide that it's better to fit a quadratic function or a second-order polynomial to this data." "And if you do that, and make a prediction here, then it looks like, well, maybe we can sell the house for closer to" "$200,000." "One of the things we'll talk about later is how to choose and how to decide do you want to fit a straight line to the data or do you want to fit the quadratic function to the data and there's no fair picking whichever one gives your" "friend the better house to sell." "But each of these would be a fine example of a learning algorithm." "So this is an example of a supervised learning algorithm." "And the term supervised learning refers to the fact that we gave the algorithm a data set in which the "right answers" were given." "That is, we gave it a data set of houses in which for every example in this data set, we told it what is the right price so what is the actual price that, that house sold for and the toss of the" "algorithm was to just produce more of these right answers such as for this new house, you know, that your friend may be trying to sell." "To define with a bit more terminology this is also called a regression problem and by regression problem I mean we're trying to predict a continuous value output." "Namely the price." "So technically I guess prices can be rounded off to the nearest cent." "So maybe prices are actually discrete values, but usually we think of the price of a house as a real number, as a scalar value, as a continuous value number and the term regression refers to the fact that we're trying to predict the sort of continuous" "values attribute." "Here's another supervised learning example, some friends and I were actually working on this earlier." "Let's see you want to look at medical records and try to predict of a breast cancer as malignant or benign." "If someone discovers a breast tumor, a lump in their breast, a malignant tumor is a tumor that is harmful and dangerous and a benign tumor is a tumor that is harmless." "So obviously people care a lot about this." "Let's see a collected data set and suppose in your data set you have on your horizontal axis the size of the tumor and on the vertical axis I'm going to plot one or zero, yes or no, whether or not these are" "examples of tumors we've seen before are malignant €"which is one €"or zero if not malignant or benign." "So let's say our data set looks like this where we saw a tumor of this size that turned out to be benign." "One of this size, one of this size." "And so on." "And sadly we also saw a few malignant tumors, one of that size, one of that size, one of that size..." "So on." "So this example..." "I have five examples of benign tumors shown down here, and five examples of malignant tumors shown with a vertical axis value of one." "And let's say we have a friend who tragically has a breast tumor, and let's say her breast tumor size is maybe somewhere around this value." "The machine learning question is, can you estimate what is the probability, what is the chance that a tumor is malignant versus benign?" "To introduce a bit more terminology this is an example of a classification problem." "The term classification refers to the fact that here we're trying to predict a discrete value output: zero or one, malignant or benign." "And it turns out that in classification problems sometimes you can have more than two values for the two possible values for the output." "As a concrete example maybe there are three types of breast cancers and so you may try to predict the discrete value of zero, one, two, or three with zero being benign." "Benign tumor, so no cancer." "And one may mean, type one cancer, like, you have three types of cancer, whatever type one means." "And two may mean a second type of cancer, a three may mean a third type of cancer." "But this would also be a classification problem, because this other discrete value set of output corresponding to, you know, no cancer, or cancer type one, or cancer type two, or cancer type three." "In classification problems there is another way to plot this data." "Let me show you what I mean." "Let me use a slightly different set of symbols to plot this data." "So if tumor size is going to be the attribute that I'm going to use to predict malignancy or benignness, I can also draw my data like this." "I'm going to use different symbols to denote my benign and malignant, or my negative and positive examples." "So instead of drawing crosses," "I'm now going to draw O's for the benign tumors." "Like so." "And I'm going to keep using X's to denote my malignant tumors." "Okay?" "I hope this is beginning to make sense." "All I did was I took, you know, these, my data set on top and I just mapped it down." "To this real line like so." "And started to use different symbols, circles and crosses, to denote malignant versus benign examples." "Now, in this example we use only one feature or one attribute, mainly, the tumor size in order to predict whether the tumor is malignant or benign." "In other machine learning problems when we have more than one feature, more than one attribute." "Here's an example." "Let's say that instead of just knowing the tumor size, we know both the age of the patients and the tumor size." "In that case maybe your data set will look like this where I may have a set of patients with those ages and that tumor size and they look like this." "And a different set of patients, they look a little different, whose tumors turn out to be malignant, as denoted by the crosses." "So, let's say you have a friend who tragically has a tumor." "And maybe, their tumor size and age falls around there." "So given a data set like this, what the learning algorithm might do is throw the straight line through the data to try to separate out the malignant tumors from the benign ones and, so the learning algorithm may decide" "to throw the straight line like that to separate out the two classes of tumors." "And." "You know, with this, hopefully you can decide that your friend's tumor is more likely to if it's over there, that hopefully your learning algorithm will say that your friend's tumor falls on this benign side and is therefore more" "likely to be benign than malignant." "In this example we had two features, namely, the age of the patient and the size of the tumor." "In other machine learning problems we will often have more features, and my friends that work on this problem, they actually use other features like these, which is clump thickness, the clump thickness of the breast tumor." "Uniformity of cell size of the tumor." "Uniformity of cell shape of the tumor, and so on, and other features as well." "And it turns out one of the interes-, most interesting learning algorithms that we'll see in this class is a learning algorithm that can deal with, not just two or three or five features, but an infinite" "number of features." "On this slide, I've listed a total of five different features." "Right, two on the axes and three more up here." "But it turns out that for some learning problems, what you really want is not to use, like, three or five features." "But instead, you want to use an infinite number of features, an infinite number of attributes, so that your learning algorithm has lots of attributes or features or cues with which to make those predictions." "So how do you deal with an infinite number of features." "How do you even store an infinite number of things on the computer when your computer is gonna run out of memory." "It turns out that when we talk about an algorithm called the Support Vector" "Machine, there will be a neat mathematical trick that will allow a computer to deal with an infinite number of features." "Imagine that I didn't just write down two features here and three features on the right." "But, imagine that I wrote down an infinitely long list, I just kept writing more and more and more features." "Like an infinitely long list of features." "Turns out, we'll be able to come up with an algorithm that can deal with that." "So, just to recap." "In this class we'll talk about supervised learning." "And the idea is that, in supervised learning, in every example in our data set, we are told what is the "correct answer" that we would have quite liked the algorithms have predicted on that example." "Such as the price of the house, or whether a tumor is malignant or benign." "We also talked about the regression problem." "And by regression, that means that our goal is to predict a continuous valued output." "And we talked about the classification problem, where the goal is to predict a discrete value output." "Just a quick wrap up question:" "Suppose you're running a company and you want to develop learning algorithms to address each of two problems." "In the first problem, you have a large inventory of identical items." "So imagine that you have thousands of copies of some identical items to sell and you want to predict how many of these items you sell within the next three months." "In the second problem, problem two, you'd like-- you have lots of users and you want to write software to examine each individual of your customer's accounts, so each one of your customer's accounts; and for each account," "decide whether or not the account has been hacked or compromised." "So, for each of these problems, should they be treated as a classification problem, or as a regression problem?" "When the video pauses, please use your mouse to select whichever of these four options on the left you think is the correct answer." "So hopefully, you got that this is the answer." "For problem one, I would treat this as a regression problem, because if I have, you know, thousands of items, well, I would probably just treat this as a real value, as a continuous value." "And treat, therefore, the number of items I sell, as a continuous value." "And for the second problem, I would treat that as a classification problem, because I might say, set the value I want to predict with zero, to denote the account has not been hacked." "And set the value one to denote an account that has been hacked into." "So just like, you know, breast cancer, is, zero is benign, one is malignant." "So I might set this be zero or one depending on whether it's been hacked, and have an algorithm try to predict each one of these two discrete values." "And because there's a small number of discrete values, I would therefore treat it as a classification problem." "So, that's it for supervised learning and in the next video I'll talk about unsupervised learning, which is the other major category of learning algorithms." "In this video, we'll talk about the second major type of machine learning problem, called Unsupervised Learning." "In the last video, we talked about Supervised Learning." "Back then, recall data sets that look like this, where each example was labeled either as a positive or negative example, whether it was a benign or a malignant tumor." "So for each example in Supervised" "Learning, we were told explicitly what is the so-called right answer, whether it's benign or malignant." "In Unsupervised Learning, we're given data that looks different than data that looks like this that doesn't have any labels or that all has the same label or really no labels." "So we're given the data set and we're not told what to do with it and we're not told what each data point is." "Instead we're just told, here is a data set." "Can you find some structure in the data?" "Given this data set, an" "Unsupervised Learning algorithm might decide that the data lives in two different clusters." "And so there's one cluster and there's a different cluster." "And yes, Supervised Learning algorithm may break these data into these two separate clusters." "So this is called a clustering algorithm." "And this turns out to be used in many places." "One example where clustering is used is in Google" "News and if you have not seen this before, you can actually to take a look." "What Google News does is everyday it goes and looks at tens of thousands or hundreds of thousands of new stories on the web and it groups them into cohesive news stories." "For example, let's look here." "The URLs here link to different news stories about the BP Oil Well story." "So, let's click on one of these URL's and we'll click on one of these URL's." "What I'll get to is a web page like this." "Here's a Wall Street" "Journal article about, you know, the BP" "Oil Well Spill stories of" ""BP Kills Macondo", which is a name of the spill and if you click on a different URL" "from that group then you might get the different story." "Here's the CNN story about a game, the BP Oil Spill, and if you click on yet a third link, then you might get a different story." "Here's the UK Guardian story about the BP Oil Spill." "So what Google News has done is look for tens of thousands of news stories and automatically cluster them together." "So, the news stories that are all about the same topic get displayed together." "It turns out that clustering algorithms and Unsupervised Learning algorithms are used in many other problems as well." "Here's one on understanding genomics." "Here's an example of DNA microarray data." "The idea is put a group of different individuals and for each of them, you measure how much they do or do not have a certain gene." "Technically you measure how much certain genes are expressed." "So these colors, red, green, gray and so on, they show the degree to which different individuals do or do not have a specific gene." "And what you can do is then run a clustering algorithm to group individuals into different categories or into different types of people." "So this is Unsupervised Learning because we're not telling the algorithm in advance that these are type 1 people, those are type 2 persons, those are type 3 persons and so on and instead what were saying is yeah here's a bunch of data." "I don't know what's in this data." "I don't know who's and what type." "I don't even know what the different types of people are, but can you automatically find structure in the data from the you automatically cluster the individuals into these types that I don't know in advance?" "Because we're not giving the algorithm the right answer for the examples in my data set, this is Unsupervised Learning." "Unsupervised Learning or clustering is used for a bunch of other applications." "It's used to organize large computer clusters." "I had some friends looking at large data centers, that is large computer clusters and trying to figure out which machines tend to work together and if you can put those machines together, you can make your data center work more efficiently." "This second application is on social network analysis." "So given knowledge about which friends you email the most or given your Facebook friends or your Google+ circles, can we automatically identify which are cohesive groups of friends, also which are groups of people that all know each other?" "Market segmentation." "Many companies have huge databases of customer information." "So, can you look at this customer data set and automatically discover market segments and automatically group your customers into different market segments so that you can automatically and more efficiently sell or market your different market segments together?" "Again, this is Unsupervised Learning because we have all this customer data, but we don't know in advance what are the market segments and for the customers in our data set, you know, we don't know in" "advance who is in market segment one, who is in market segment two, and so on." "But we have to let the algorithm discover all this just from the data." "Finally, it turns out that Unsupervised" "Learning is also used for surprisingly astronomical data analysis and these clustering algorithms gives surprisingly interesting useful theories of how galaxies are born." "All of these are examples of clustering, which is just one type of Unsupervised Learning." "Let me tell you about another one." "I'm gonna tell you about the cocktail party problem." "So, you've been to cocktail parties before, right?" "Well, you can imagine there's a party, room full of people, all sitting around, all talking at the same time and there are all these overlapping voices because everyone is talking at the same time, and" "it is almost hard to hear the person in front of you." "So maybe at a cocktail party with two people," "two people talking at the same time, and it's a somewhat small cocktail party." "And we're going to put two microphones in the room so there are microphones, and because these microphones are at two different distances from the speakers, each microphone records a different combination of these two speaker voices." "Maybe speaker one is a little louder in microphone one and maybe speaker two is a little bit louder on microphone 2 because the 2 microphones are at different positions relative to the 2 speakers, but each microphone would cause an overlapping" "combination of both speakers' voices." "So here's an actual recording of two speakers recorded by a researcher." "Let me play for you the first, what the first microphone sounds like." "One (uno), two (dos), three (tres), four (cuatro), five" "(cinco), six (seis), seven (siete), eight (ocho), nine (nueve), ten (y diez)." "All right, maybe not the most interesting cocktail party, there's two people counting from one to ten in two languages but you know." "What you just heard was the first microphone recording, here's the second recording." "Uno (one), dos (two), tres (three), cuatro" "(four), cinco (five), seis (six), siete (seven), ocho (eight), nueve (nine) y diez (ten)." "So we can do, is take these two microphone recorders and give them to an Unsupervised Learning algorithm called the cocktail party algorithm, and tell the algorithm" " find structure in this data for you." "And what the algorithm will do is listen to these audio recordings and say, you know it sounds like the two audio recordings are being added together or that have being summed together to produce these recordings that we had." "Moreover, what the cocktail party algorithm will do is separate out these two audio sources that were being added or being summed together to form other recordings and, in fact, here's the first output of the cocktail party algorithm." "One, two, three, four, five, six, seven, eight, nine, ten." "So, I separated out the English voice in one of the recordings." "And here's the second of it." "Uno, dos, tres, quatro, cinco, seis, siete, ocho, nueve y diez." "Not too bad, to give you one more example, here's another recording of another similar situation, here's the first microphone :" "One, two, three, four, five, six, seven, eight, nine, ten." "OK so the poor guy's gone home from the cocktail party and he 's now sitting in a room by himself talking to his radio." "Here's the second microphone recording." "One, two, three, four, five, six, seven, eight, nine, ten." "When you give these two microphone recordings to the same algorithm, what it does, is again say, you know, it sounds like there are two audio sources, and moreover," "the album says, here is the first of the audio sources I found." "One, two, three, four, five, six, seven, eight, nine, ten." "So that wasn't perfect, it got the voice, but it also got a little bit of the music in there." "Then here's the second output to the algorithm." "Not too bad, in that second output it managed to get rid of the voice entirely." "And just, you know, cleaned up the music, got rid of the counting from one to ten." "So you might look at an Unsupervised Learning algorithm like this and ask how complicated this is to implement this, right?" "It seems like in order to, you know, build this application, it seems like to do this audio processing you need to write a ton of code or maybe link into like a bunch of synthesizer Java libraries that" "process audio, seems like a really complicated program, to do this audio, separating out audio and so on." "It turns out the algorithm, to do what you just heard, that can be done with one line of code - shown right here." "It take researchers a long time to come up with this line of code." "I'm not saying this is an easy problem," "But it turns out that when you use the right programming environment, many learning algorithms can be really short programs." "So this is also why in this class we're going to use the Octave programming environment." "Octave, is free open source software, and using a tool like Octave or Matlab, many learning algorithms become just a few lines of code to implement." "Later in this class, I'll just teach you a little bit about how to use Octave and you'll be implementing some of these algorithms in Octave." "Or if you have Matlab you can use that too." "It turns out the Silicon Valley, for a lot of machine learning algorithms, what we do is first prototype our software in Octave because software in Octave makes it incredibly fast to implement these learning algorithms." "Here each of these functions like for example the SVD function that stands for singular value decomposition; but that turns out to be a linear algebra routine, that is just built into Octave." "If you were trying to do this in C++ or Java, this would be many many lines of code linking complex C++ or Java libraries." "So, you can implement this stuff as" "C++ or Java or Python, it's just much more complicated to do so in those languages." "What I've seen after having taught machine learning for almost a decade now, is that, you learn much faster if you use Octave as your programming environment, and if you use Octave as your learning tool and as your" "prototyping tool, it'll let you learn and prototype learning algorithms much more quickly." "And in fact what many people will do to in the large Silicon" "Valley companies is in fact, use an algorithm like Octave to first prototype the learning algorithm, and only after you've gotten it to work, then you migrate it to C++ or Java or whatever." "It turns out that by doing things this way, you can often get your algorithm to work much faster than if you were starting out in C++." "So, I know that as an instructor, I get to say "trust me on this one" only a finite number of times, but for those of you who've never used these" "Octave type programming environments before," "I am going to ask you to trust me on this one, and say that you, you will," "I think your time, your development time is one of the most valuable resources." "And having seen lots of people do this, I think you as a machine learning researcher, or machine learning developer will be much more productive if you learn to start in prototype, to start in Octave, in some other language." "Finally, to wrap up this video, I have one quick review question for you." "We talked about Unsupervised Learning, which is a learning setting where you give the algorithm a ton of data and just ask it to find structure in the data for us." "Of the following four examples, which ones, which of these four do you think would will be an Unsupervised Learning algorithm as opposed to Supervised Learning problem." "For each of the four check boxes on the left, check the ones for which you think Unsupervised Learning algorithm would be appropriate and then click the button on the lower right to check your answer." "So when the video pauses, please answer the question on the slide." "So, hopefully, you've remembered the spam folder problem." "If you have labeled data, you know, with spam and non-spam e-mail, we'd treat this as a Supervised Learning problem." "The news story example, that's exactly the Google News example that we saw in this video, we saw how you can use a clustering algorithm to cluster these articles together so that's Unsupervised Learning." "The market segmentation example I talked a little bit earlier, you can do that as an Unsupervised Learning problem because I am just gonna get my algorithm data and ask it to discover market segments automatically." "And the final example, diabetes, well, that's actually just like our breast cancer example from the last video." "Only instead of, you know, good and bad cancer tumors or benign or malignant tumors we instead have diabetes or not and so we will use that as a supervised, we will solve that as a Supervised Learning problem just like" "we did for the breast tumor data." "So, that's it for Unsupervised" "Learning and in the next video, we'll delve more into specific learning algorithms and start to talk about just how these algorithms work and how we can, how you can go about implementing them." "Our first learning algorithm will be linear regression." "In this video, you'll see what the model looks like and more importantly you'll see what the overall process of supervised learning looks like." "Let's use some motivating example of predicting housing prices." "We're going to use a data set of housing prices from the city of" "Portland, Oregon." "And here I'm gonna plot my data set of a number of houses that were different sizes that were sold for a range of different prices." "Let's say that given this data set, you have a friend that's trying to sell a house and let's see if friend's house is size of 1250 square feet and you want to tell them how much they might be able to sell the house for." "Well one thing you could do is fit a model." "Maybe fit a straight line to this data." "Looks something like that and based on that, maybe you could tell your friend that let's say maybe he can sell the house for around $220,000." "So this is an example of a supervised learning algorithm." "And it's supervised learning because we're given the, quotes, "right answer" for each of our examples." "Namely we're told what was the actual house, what was the actual price of each of the houses in our data set were sold for and moreover, this is an example of a regression problem where the term regression refers to the fact that we are predicting a real-valued output" "namely the price." "And just to remind you the other most common type of supervised learning problem is called the classification problem where we predict discrete-valued outputs such as if we are looking at cancer tumors and trying to decide if a tumor is malignant or benign." "So that's a zero-one valued discrete output." "More formally, in supervised learning, we have a data set and this data set is called a training set." "So for housing prices example, we have a training set of different housing prices and our job is to learn from this data how to predict prices of the houses." "Let's define some notation that we're using throughout this course." "We're going to define quite a lot of symbols." "It's okay if you don't remember all the symbols right now but as the course progresses it will be useful [inaudible] convenient notation." "So I'm gonna use lower case m throughout this course to denote the number of training examples." "So in this data set, if I have, you know, let's say 47 rows in this table." "Then I have 47 training examples and m equals 47." "Let me use lowercase x to denote the input variables often also called the features." "That would be the x is here, it would the input features." "And I'm gonna use y to denote my output variables or the target variable which I'm going to predict and so that's the second column here. [inaudible] notation, I'm going to use (x, y) to denote a single training example." "So, a single row in this table corresponds to a single training example and to refer to a specific training example, I'm going to use this notation x(i) comma gives me y(i) And, we're" "going to use this to refer to the ith training example." "So this superscript i over here, this is not exponentiation right?" "This (x(i), y(i)), the superscript i in parentheses that's just an index into my training set and refers to the ith row in this table, okay?" "So this is not x to the power of i, y to the power of i." "Instead" "(x(i), y(i)) just refers to the ith row of this table." "So for example, x(1) refers to the input value for the first training example so that's 2104." "That's this x in the first row. x(2) will be equal to 1416 right?" "That's the second x and y(1) will be equal to 460." "The first, the y value for my first training example, that's what that (1) refers to." "So as mentioned, occasionally I'll ask you a question to let you check your understanding and a few seconds in this video a multiple-choice question will pop up in the video." "When it does, please use your mouse to select what you think is the right answer." "What defined by the training set is." "So here's how this supervised learning algorithm works." "We saw that with the training set like our training set of housing prices and we feed that to our learning algorithm." "Is the job of a learning algorithm to then output a function which by convention is usually denoted lowercase h and h stands for hypothesis And what the job of the hypothesis is, is, is a function that takes as input the size of a house like maybe the size of the new house your friend's" "trying to sell so it takes in the value of x and it tries to output the estimated value of y for the corresponding house." "So h is a function that maps from x's to y's." "People often ask me, you know, why is this function called hypothesis." "Some of you may know the meaning of the term hypothesis, from the dictionary or from science or whatever." "It turns out that in machine learning, this is a name that was used in the early days of machine learning and it kinda stuck. 'Cause maybe not a great name for this sort of function, for mapping from sizes of" "houses to the predictions, that you know...." "I think the term hypothesis, maybe isn't the best possible name for this, but this is the standard terminology that people use in machine learning." "So don't worry too much about why people call it that." "When designing a learning algorithm, the next thing we need to decide is how do we represent this hypothesis h." "For this and the next few videos, I'm going to choose our initial choice , for representing the hypothesis, will be the following." "We're going to represent h as follows." "And we will write this as htheta(x) equals theta0 plus theta1 of x." "And as a shorthand, sometimes instead of writing, you know, h subscript theta of x, sometimes there's a shorthand, I'll just write as a h of x." "But more often I'll write it as a subscript theta over there." "And plotting this in the pictures, all this means is that, we are going to predict that y is a linear function of x." "Right, so that's the data set and what this function is doing, is predicting that y is some straight line function of x." "That's h of x equals theta 0 plus theta 1 x, okay?" "And why a linear function?" "Well, sometimes we'll want to fit more complicated, perhaps non-linear functions as well." "But since this linear case is the simple building block, we will start with this example first of fitting linear functions, and we will build on this to eventually have more complex models, and more complex learning algorithms." "Let me also give this particular model a name." "This model is called linear regression or this, for example, is actually linear regression with one variable, with the variable being x." "Predicting all the prices as functions of one variable X. And another name for this model is univariate linear regression." "And univariate is just a fancy way of saying one variable." "So, that's linear regression." "In the next video we'll start to talk about just how we go about implementing this model." "In this video we'll define something called the cost function." "This will let us figure out how to fit the best possible straight line to our data." "In linear regression we have a training set like that shown here." "Remember our notation M was the number of training examples." "So maybe M=47." "And the form of the hypothesis, which we use to make predictions, is this linear function." "To introduce a little bit more terminology, these theta zero and theta one, right, these theta i's are what I call the parameters of the model." "What we're going to do in this video is talk about how to go about choosing these two parameter values, theta zero and theta one." "With different choices of parameters theta zero and theta one we get different hypotheses, different hypothesis functions." "I know some of you will probably be already familiar with what I'm going to do on this slide, but just to review here are a few examples." "If theta zero is 1.5 and theta one is 0, then the hypothesis function will look like this." "Right, because your hypothesis function will be h( x) equals 1.5 plus 0 times x which is this constant value function, this is flat at 1.5." "If theta zero equals 0 and theta one equals 0.5, then the hypothesis will look like this." "And it should pass through this point (2, 1), says you now have h(x) or really some htheta(x) but sometimes I'll just omit theta for brevity." "So, h(x) will be equal to just 0.5 times x which looks like that." "And finally if theta zero equals 1 and theta one equals 0.5 then we end up with the hypothesis that looks like this." "Let's see, it should pass through the (2, 2) point like so." "And this is my new h(x) or my new htheta(x)." "All right?" "Well you remember that this is htheta(x) but as a shorthand sometimes I just write this as h(x)." "In linear regression we have a training set, like maybe the one I've plotted here." "What we want to do is come up with values for the parameters theta zero and theta one." "So that the straight line we get out of this corresponds to a straight line that somehow fits the data well." "Like maybe that line over there." "So how do we come up with values theta zero, theta one that corresponds to a good fit to the data?" "The idea is we're going to choose our parameters theta zero, theta one so that h(x), meaning the value we predict on input x, that this at least close to the values y for the examples in our" "training set, for our training examples." "So, in our training set we're given a number of examples where we know x decides the house and we know the actual price of what it's sold for." "So let's try to choose values for the parameters so that at least in the training set, given the x's in the training set, we make reasonably accurate predictions for the y values." "Let's formalize this." "So linear regression, what we're going to do is that I'm going to want to solve a minimization problem." "So I'm going to write minimize over theta zero, theta one." "And, I want this to be small, right, I want the difference between h(x) and y to be small." "And one thing I'm gonna do is try to minimize the square difference between the output of the hypothesis and the actual price of the house." "Okay?" "So let's fill in some details." "Remember that I was using the notation (x(i), y(i)) to represent the ith training example." "So what I want really is to sum over my training set." "Sum from i equals 1 to M of the square difference between this is the prediction of my hypothesis when it is input the size of house number i, right, minus the actual price that house number i will sell for and I want to" "minimize the sum of my training set sum from i equals 1 through M of the difference of this squared error, square difference between the predicted price of the house and the price that it will actually sell for." "And just remind you of your notation M here was the, the size of my training set, right, so the M there is my number of training examples." "Right?" "That hash sign is the abbreviation for "number" of training examples." "Okay?" "And to make some of our, make the math a little bit easier, I'm going to actually look at, you know, 1 over M times that." "So we're going to try to minimize my average error, which we're going to minimize one by 2M." "Putting the 2, the constant one half, in front it just makes some of the math a little easier." "So minimizing one half of something, right, should give you the same values of the parameters theta zero, theta one as minimizing that function." "And just make sure this, this, this equation is clear, right?" "This expression in here, htheta(x), this is my, this is our usual, right?" "That's equal to this plus theta one x(i)." "And, this notation, minimize over theta zero and theta one, this means find me the values of theta zero and theta one that causes this expression to be minimized." "And this expression depends on theta zero and theta one." "Okay?" "So just to recap, we're posing this problem as find me the values of theta zero and theta one so that the average already one over two M times the sum of square errors between my predictions on the training set minus the actual values of the houses on the" "training set is minimized." "So this is going to be my overall objective function for linear regression." "And just to, you know rewrite this out a little bit more cleanly what I'm going to do by convention is we usually define a cost function." "Which is going to be exactly this." "That formula that I have up here." "And what I want to do is minimize over theta zero and theta one my function J of theta zero comma theta one." "Just write this out, this is my cost function." "So, this cost function is also called the squared error function or sometimes called the square error cost function and it turns out that Why, why do we, you know, take the squares of the errors?" "It turns out that the squared error cost function is a reasonable choice and will work well for most problems, for most regression problems." "There are other cost functions that will work pretty well, but the squared error cost function is probably the most commonly used one for regression problems." "Later in this class we'll also talk about alternative cost functions as well, but this, this choice that we just had, should be a pret-, pretty reasonable thing to try for most linear regression problems." "Okay." "So that's the cost function." "So far we've just seen a mathematical definition of you know this cost function and in case this function J of theta zero theta one in case this function seems a little bit abstract and you still don't have a good sense of what its doing in the next video, in the" "next couple videos we're actually going to go a little bit deeper into what the cost function J is doing and try to give you better intuition about what its computing and why we want to use it." "In the previous video, we gave the mathematical definition of the cost function." "In this video, let's look at some examples, to get back to intuition about what the cost function is doing, and why we want to use it." "To recap, here's what we had last time." "We want to fit a straight line to our data, so we had this formed as a hypothesis with these parameters theta zero and theta one, and with different choices of the parameters we end up with different straight line" "fits." "So the data which are fit like so, and there's a cost function, and that was our optimization objective." "[sound] So this video, in order to better visualize the cost function J, I'm going to work with a simplified hypothesis function, like that shown on the right." "So I'm gonna use my simplified hypothesis, which is just theta one times X. We can, if you want, think of this as setting the parameter theta zero equal to 0." "So I have only one parameter theta one and my cost function is similar to before except that now H of X that is now equal to just theta one times X. And I have only one parameter theta one and so my" "optimization objective is to minimize j of theta one." "In pictures what this means is that if theta zero equals zero that corresponds to choosing only hypothesis functions that pass through the origin, that pass through the point (0, 0)." "Using this simplified definition of a hypothesizing cost function let's try to understand the cost function concept better." "It turns out that two key functions we want to understand." "The first is the hypothesis function, and the second is a cost function." "So, notice that the hypothesis, right, H of X. For a face value of theta one, this is a function of X. So the hypothesis is a function of, what is the size of the house" "X. In contrast, the cost function, J, that's a function of the parameter, theta one, which controls the slope of the straight line." "Let's plot these functions and try to understand them both better." "Let's start with the hypothesis." "On the left, let's say here's my training set with three points at (1, 1), (2, 2), and (3, 3)." "Let's pick a value theta one, so when theta one equals one, and if that's my choice for theta one, then my hypothesis is going to look like this straight line over here." "And I'm gonna point out, when I'm plotting my hypothesis function." "X-axis, my horizontal axis is labeled X, is labeled you know, size of the house over here." "Now, of temporary, set theta one equals one, what I want to do is figure out what is j of theta one, when theta one equals one." "So let's go ahead and compute what the cost function has for." "You'll devalue one." "Well, as usual, my cost function is defined as follows, right?" "Some from, some of 'em are training sets of this usual squared error term." "And, this is therefore equal to." "And this." "Of theta one x I minus y I and if you simplify this turns out to be." "That." "Zero" "Squared to zero squared to zero squared which is of course, just equal to zero." "Now, inside the cost function." "It turns out each of these terms here is equal to zero." "Because for the specific training set I have or my 3 training examples are (1, 1), (2, 2), (3,3)." "If theta one is equal to one." "Then h of x." "H of x i." "Is equal to y I exactly, let me write this better." "Right?" "And so, h of x minus y, each of these terms is equal to zero, which is why I find that j of one is equal to zero." "So, we now know that j of one Is equal to zero." "Let's plot that." "What I'm gonna do on the right is plot my cost function j." "And notice, because my cost function is a function of my parameter theta one, when I plot my cost function, the horizontal axis is now labeled with theta one." "So I have j of one zero zero so let's go ahead and plot that." "End up with." "An X over there." "Now lets look at some other examples." "Theta-1 can take on a range of different values." "Right?" "So theta-1 can take on the negative values, zero, positive values." "So what if theta-1 is equal to 0.5." "What happens then?" "Let's go ahead and plot that." "I'm now going to set theta-1 equals 0.5, and in that case my hypothesis now looks like this." "As a line with slope equals to 0.5, and, lets compute J, of 0.5." "So that is going to be one over 2M of, my usual cost function." "It turns out that the cost function is going to be the sum of square values of the height of this line." "Plus the sum of square of the height of that line, plus the sum of square of the height of that line, right?" "?" "Cause just this vertical distance, that's the difference between, you know, Y. I. and the predicted value, H of XI, right?" "So the first example is going to be 0.5 minus one squared." "Because my hypothesis predicted 0.5." "Whereas, the actual value was one." "For my second example, I get, one minus two squared, because my hypothesis predicted one, but the actual housing price was two." "And then finally, plus. 1.5 minus three squared." "And so that's equal to one over two times three." "Because, M when trading set size, right, have three training examples." "In that, that's times simplifying for the parentheses it's 3.5." "So that's 3.5 over six which is about 0.68." "So now we know that j of 0.5 is about 0.68." "Lets go and plot that." "Oh excuse me, math error, it's actually 0.58." "So we plot that which is maybe about over there." "Okay?" "Now, let's do one more." "How about if theta one is equal to zero, what is J of zero equal to?" "It turns out that if theta one is equal to zero, then H of X is just equal to, you know, this flat line, right, that just goes horizontally like this." "And so, measuring the errors." "We have that J of zero is equal to one over two M, times one squared plus two squared plus three squared, which is, One six times fourteen which is about 2.3." "So let's go ahead and plot as well." "So it ends up with a value around 2.3 and of course we can keep on doing this for other values of theta one." "It turns out that you can have you know negative values of theta one as well so if theta one is negative then h of x would be equal to say minus 0.5 times x then theta one is minus 0.5 and so that corresponds" "to a hypothesis with a slope of negative 0.5." "And you can actually keep on computing these errors." "This turns out to be, you know, for 0.5, it turns out to have really high error." "It works out to be something, like, 5.25." "And so on, and the different values of theta one, you can compute these things, right?" "And it turns out that you, your computed range of values, you get something like that." "And by computing the range of values, you can actually slowly create out." "What does function J of Theta say and that's what J of Theta is." "To recap, for each value of theta one, right?" "Each value of theta one corresponds to a different hypothesis, or to a different straight line fit on the left." "And for each value of theta one, we could then derive a different value of j of theta one." "And for example, you know, theta one=1, corresponded to this straight line straight through the data." "Whereas theta one=0.5." "And this point shown in magenta corresponded to maybe that line, and theta one=zero which is shown in blue that corresponds to this horizontal line." "Right, so for each value of theta one we wound up with a different value of J of theta one and we could then use this to trace out this plot on the right." "Now you remember, the optimization objective for our learning algorithm is we want to choose the value of theta one." "That minimizes J of theta one." "Right?" "This was our objective function for the linear regression." "Well, looking at this curve, the value that minimizes j of theta one is, you know, theta one equals to one." "And low and behold, that is indeed the best possible straight line fit through our data, by setting theta one equals one." "And just, for this particular training set, we actually end up fitting it perfectly." "And that's why minimizing j of theta one corresponds to finding a straight line that fits the data well." "So, to wrap up." "In this video, we looked up some plots." "To understand the cost function." "To do so, we simplify the algorithm." "So that it only had one parameter theta one." "And we set the parameter theta zero to be only zero." "In the next video." "We'll go back to the original problem formulation and look at some visualizations involving both theta zero and theta one." "That is without setting theta zero to zero." "And hopefully that will give you, an even better sense of what the cost function j is doing in the original linear regression formulation." "In this video, lets delve deeper and get even better intuition about what the cost function is doing." "This video assumes that you're familiar with contour plots." "If you are not familiar with contour plots or contour figures some of the illustrations in this video may or may not make sense to you but is okay and if you end up skipping this video or some of it does not quite make sense because you haven't seen" "contour plots before." "That's okay and you will still understand the rest of this course without those parts of this." "Here's our problem formulation as usual, with the hypothesis parameters, cost function, and our optimization objective." "Unlike before, unlike the last video, I'm going to keep both of my parameters, theta zero, and theta one, as we generate our visualizations for the cost function." "So, same as last time, we want to understand the hypothesis H and the cost function J. So, here's my training set of housing prices and let's make some hypothesis." "You know, like that one, this is not a particularly good hypothesis." "But, if I set theta zero=50 and theta one=0.06, then I end up with this hypothesis down here and that corresponds to that straight line." "Now given these value of theta zero and theta one, we want to plot the corresponding, you know, cost function on the right." "What we did last time was, right, when we only had theta one." "In other words, drawing plots that look like this as a function of theta one." "But now we have two parameters, theta zero, and theta one, and so the plot gets a little more complicated." "It turns out that when we have only one parameter, that the parts we drew had this sort of bow shaped function." "Now, when we have two parameters, it turns out the cost function also has a similar sort of bow shape." "And, in fact, depending on your training set, you might get a cost function that maybe looks something like this." "So, this is a 3-D surface plot, where the axes are labeled theta zero and theta one." "So as you vary theta zero and theta one, the two parameters, you get different values of the cost function J (theta zero, theta one) and the height of this surface above a particular point of theta zero, theta one." "Right, that's, that's the vertical axis." "The height of the surface of the points indicates the value of J of theta zero, J of theta one." "And you can see it sort of has this bow like shape." "Let me show you the same plot in 3D." "So here's the same figure in 3D, horizontal axis theta one and vertical axis J(theta zero, theta one), and if I rotate this plot around." "You kinda of a get a sense, I hope, of this bowl shaped surface as that's what the cost function J looks like." "Now for the purpose of illustration in the rest of this video" "I'm not actually going to use these sort of 3D surfaces to show you the cost function J, instead I'm going to use contour plots." "Or what I also call contour figures." "I guess they mean the same thing." "To show you these surfaces." "So here's an example of a contour figure, shown on the right, where the axis are theta zero and theta one." "And what each of these ovals, what each of these ellipsis shows is a set of points that takes on the same value for J(theta zero, theta one)." "So concretely, for example this, you'll take that point and that point and that point." "All three of these points that I just drew in magenta, they have the same value for J (theta zero, theta one)." "Okay." "Where, right, these, this is the theta zero, theta one axis but those three have the same Value for J (theta zero, theta one) and if you haven't seen contour plots much before think of, imagine if you" "will." "A bow shaped function that's coming out of my screen." "So that the minimum, so the bottom of the bow is this point right there, right?" "This middle, the middle of these concentric ellipses." "And imagine a bow shape that sort of grows out of my screen like this, so that each of these ellipses, you know, has the same height above my screen." "And the minimum with the bow, right, is right down there." "And so the contour figures is a, is way to, is maybe a more convenient way to visualize my function J. [sound] So, let's look at some examples." "Over here, I have a particular point, right?" "And so this is, with, you know, theta zero equals maybe about 800, and theta one equals maybe a -0.15 ." "And so this point, right, this point in red corresponds to one set of pair values of theta zero, theta one and the corresponding, in fact, to that hypothesis, right, theta zero is about 800, that is, where it intersects the vertical axis is around 800, and this is" "slope of about -0.15." "Now this line is really not such a good fit to the data, right." "This hypothesis, h(x), with these values of theta zero, theta one, it's really not such a good fit to the data." "And so you find that, it's cost." "Is a value that's out here that's you know pretty far from the minimum right it's pretty far this is a pretty high cost because this is just not that good a fit to the data." "Let's look at some more examples." "Now here's a different hypothesis that's you know still not a great fit for the data but may be slightly better so here right that's my point that those are my parameters theta zero theta one and so my theta zero value." "Right?" "That's bout 360 and my value for theta one." "Is equal to zero." "So, you know, let's break it out." "Let's take theta zero equals 360 theta one equals zero." "And this pair of parameters corresponds to that hypothesis, corresponds to flat line, that is, h(x) equals 360 plus zero times x." "So that's the hypothesis." "And this hypothesis again has some cost, and that cost is, you know, plotted as the height of the J function at that point." "Let's look at just a couple of examples." "Here's one more, you know, at this value of theta zero, and at that value of theta one, we end up with this hypothesis, h(x) and again, not a great fit to the data, and is actually further away from the minimum." "Last example, this is actually not quite at the minimum, but it's pretty close to the minimum." "So this is not such a bad fit to the, to the data, where, for a particular value, of, theta zero." "Which, one of them has value, as in for a particular value for theta one." "We get a particular h(x)." "And this is, this is not quite at the minimum, but it's pretty close." "And so the sum of squares errors is sum of squares distances between my, training samples and my hypothesis." "Really, that's a sum of square distances, right?" "Of all of these errors." "This is pretty close to the minimum even though it's not quite the minimum." "So with these figures I hope that gives you a better understanding of what values of the cost function J, how they are and how that corresponds to different hypothesis and so as how better hypotheses may corresponds to points" "that are closer to the minimum of this cost function J. Now of course what we really want is an efficient algorithm, right, a efficient piece of software for automatically finding The value of theta zero and theta one, that minimizes the" "cost function J, right?" "And what we, what we don't wanna do is to, you know, how to write software, to plot out this point, and then try to manually read off the numbers, that this is not a good way to do it." "And, in fact, we'll see it later, that when we look at more complicated examples, we'll have high dimensional figures with more parameters, that, it turns out, we'll see in a few, we'll see later in" "this course, examples where this figure, you know, cannot really be plotted, and this becomes much harder to visualize." "And so, what we want is to have software to find the value of theta zero, theta one that minimizes this function and in the next video we start to talk about an algorithm for automatically finding that value of theta zero and theta one that minimizes the cost function J." "We've previously defined the cost function J. In this video I want to tell you about an algorithm called gradient descent for minimizing the cost function J. It turns out gradient descent is a more general algorithm and is used not only in linear" "regression." "It's actually used all over the place in machine learning." "And later in the class we'll use gradient descent to minimize other functions as well, not just the cost function J, for linear regression." "So in this video, I'm going to talk about gradient descent for minimizing some arbitrary function J. And then in later videos, we'll take those algorithm and apply it specifically to the cost function J that we had to find for linear regression." "So here's the problem setup." "We're going to see that we have some function J of (theta0, theta1)." "Maybe it's a cost function from linear regression." "Maybe it's some other function we want to minimize." "And we want to come up with an algorithm for minimizing that as a function of J of (theta0, theta1)." "Just as an aside, it turns out that gradient descent actually applies to more general functions." "So imagine if you have a function that's a function of" "J of (theta0, theta1, theta2, up to some theta n)." "And you want to minimize over (theta0 up to theta n) of this J of (theta0 up to theta n)." "It turns out gradient descent is an algorithm for solving this more general problem, but for the sake of brevity, for the sake of your succinctness of notation, I'm just going to present only two parameters throughout the rest of this video." "Here's the idea for gradient descent." "What we're going to do is we're going to start off with some initial guesses for theta0 and theta1." "Doesn't really matter what they are, but a common choice would be if we set theta0 to 0, and set theta1 to 0." "Just initialize them to 0." "What we're going to do in gradient descent is we'll keep changing theta0 and theta1 a little bit to try to reduce J of (theta0, theta1) until hopefully we wind up at a minimum or maybe a local minimum." "So, let's see see pictures of what gradient descent does." "Let's say I try to minimize this function." "So notice the axes." "This is, (theta0, theta1) on the horizontal axes, and J is a vertical axis." "And so the height of the surface shows J, and we want to minimize this function." "So, we're going to start off with (theta0, theta1) at some point." "So imagine picking some value for (theta0, theta1), and that corresponds to starting at some point on the surface of this function." "Okay?" "So whatever value of (theta0, theta1) gives you some point here." "I did initialize them to 0, but sometimes you initialize it to other values as well." "Now." "I want us to imagine that this figure shows a hill." "Imagine this is like a landscape of some grassy park with two hills like so." "And I want you to imagine that you are physically standing at that point on the hill on this little red hill in your park." "In gradient descent, what we're going to do is spin 360 degrees around and just look all around us and ask, "If I were to take a little baby step in some direction, and I want to go downhill as" "quickly as possible, what direction do I take that little baby step in if I want to go down, if I sort of want to physically walk down this hill as rapidly as possible?" Turns out that if we're standing at that point on the hill, you look all" "around, you find that the best direction to take a little step downhill is roughly that direction." "Okay." "And now you're at this new point on your hill." "You're going to, again, look all around, and then say, "What direction should I step in order to take a little baby step downhill?" And if you do that and take another step, you" "take a step in that direction, and then you keep going." "You know, from this new point, you look around, decide what direction will take you downhill most quickly, take another step, another step, and so on, until you converge to this, local minimum down here." "Further descent has an interesting property." "This first time we ran gradient descent, we were starting at this point over here, right?" "Started at that point over here." "Now imagine, we initialize gradient descent just a couple steps to the right." "Imagine we initialized gradient descent with that point on the upper right." "If you were to repeat this process, and stop at the point, and look all around." "Take a little step in the direction of steepest descent." "You would do that." "Then look around, take another step, and so on." "And if you start it just a couple steps to the right, the gradient descent will have taken you to this second local optimum over on the right." "So if you had started at this first point, you would have wound up at this local optimum." "But if you started just a little bit, a slightly different location, you would have wound up at a very different local optimum." "And this is a property of gradient descent that we'll say a little bit more about later." "So, that's the intuition in pictures." "Let's look at the map." "This is the definition of the gradient descent algorithm." "We're going to just repeatedly do this." "On to convergence." "We're going to update my parameter theta j by, you know, taking theta j and subtracting from it alpha times this term over here." "So, let's see." "There are a lot of details in this equation, so let me unpack some of it." "First, this notation here, colon equals." "We're going to use := to denote assignment, so it's the assignment operator." "So concretely, if I write A: =B, what this means in a computer, this means take the value in B and use it to overwrite whatever the value of A. So this means we will set A to be equal to the value of B." "Okay, it's assignment." "I can also do A:=A+1." "This means take A and increase its value by one." "Whereas in contrast, if I use the equals sign and I write A=B, then this is a truth assertion." "So if I write A=B, then I'm asserting that the value of A equals to the value of B. So the left hand side, that's a computer operation, where you set the value of A to be a value." "The right hand side, this is asserting, I'm just making a claim that the values of A and B are the same." "And so, whereas I can write A: =A+1, that means increment A by 1." "Hopefully, I won't ever write A=A+1." "Because that's just wrong." "A and A+1 can never be equal to the same values." "So that's the first part of the definition." "This alpha here is a number that is called the learning rate." "And what alpha does is, it basically controls how big a step we take downhill with gradient descent." "So if alpha is very large, then that corresponds to a very aggressive gradient descent procedure, where we're trying to take huge steps downhill." "And if alpha is very small, then we're taking little, little baby steps downhill." "And, I'll come back and say more about this later." "About how to set alpha and so on." "And finally, this term here." "That's the derivative term, I don't want to talk about it right now, but I will derive this derivative term and tell you exactly what this is based on." "And some of you will be more familiar with calculus than others, but even if you aren't familiar with calculus don't worry about it, I'll tell you what you need to know about this term here." "Now there's one more subtlety about gradient descent which is, in gradient descent, we're going to update theta0 and theta1." "So this update takes place where j=0, and j=1." "So you're going to update j, theta0, and update theta1." "And the subtlety of how you implement gradient descent is, for this expression, for this update equation, you want to simultaneously update theta0 and theta1." "What I mean by that is that in this equation, we're going to update theta0:=theta0 - something, and update theta1:=theta1 - something." "And the way to implement this is, you should compute the right hand side." "Compute that thing for both theta0 and theta1, and then simultaneously at the same time update theta0 and theta1." "So let me say what I mean by that." "This is a correct implementation of gradient descent meaning simultaneous updates." "I'm going to set temp0 equals that, set temp1 equals that." "So basically compute the right hand sides." "And then having computed the right hand sides and stored them together in temp0 and temp1, I'm going to update theta0 and theta1 simultaneously, because that's the correct implementation." "In contrast, here's an incorrect implementation that does not do a simultaneous update." "So in this incorrect implementation, we compute temp0, and then we update theta0 and then we compute temp1." "Then we update temp1." "And the difference between the right hand side and the left hand side implementations is that if we look down here, you look at this step, if by this time you've already updated theta0 then you would be using the new value of theta0 to compute this" "derivative term and so this gives you a different value of temp1 than the left hand side, because you've now plugged in the new value of theta0 into this equation." "And so this on right hand side is not a correct implementation of gradient descent." "So I don't want to say why you need to do the simultaneous updates, it turns out that the way gradient descent is usually implemented, we'll say more about it later, it actually turns out to" "be more natural to implement the simultaneous update." "And when people talk about gradient descent, they always mean simultaneous update." "If you implement the non-simultaneous update, it turns out it will probably work anyway, but this algorithm on the right is not what people people refer to as gradient descent and this is some other algorithm with different properties." "And for various reasons, this can behave in slightly stranger ways." "And what you should do is to really implement the simultaneous update of gradient descent." "So, that's the outline of the gradient descent algorithm." "In the next video, we're going to go into the details of the derivative term, which I wrote out but didn't really define." "And if you've taken a calculus class before and if you're familiar with partial derivatives and derivatives, it turns out that's exactly what that derivative term is." "But in case you aren't familiar with calculus, don't worry about it." "The next video will give you all the intuitions and will tell you everything you need to know to compute that derivative term, even if you haven't seen calculus, or even if you haven't seen partial derivatives before." "And with that, with the next video, hopefully, we'll be able to give all the intuitions you need to apply gradient descent." "In this and the next video," "I'd like to tell you about one possible extension to the anomaly detection algorithm that we've developed so far." "This extension uses something called the multivariate" "Gaussian distribution, and it has some advantages, and some disadvantages, and it can sometimes catch some anomalies that the earlier algorithm didn't." "To motivate this, let's start with an example." "Let's say that so our unlabeled data looks like what I have plotted here." "And I'm going to use the example of monitoring machines in the data center, monitoring computers in the data center." "So my two features are x1 which is the CPU load and x2 which is maybe the memory use." "So if I take my two features, x1 and x2, and I model them as Gaussians then here's a plot of my X1 features, here's a plot of my X2 features, and so if I fit a" "Gaussian to that, maybe I'll get a Gaussian like this, so here's P of X 1, which depends on the parameters mu 1, and sigma squared 1, and here's my memory used, and," "you know, maybe I'll get a Gaussian that looks like this, and this is my P of X 2, which depends on mu 2 and sigma squared 2." "And so this is how the anomaly detection algorithm models X1 and X2." "Now let's say that in the test sets I have an example that looks like this." "The location of that green cross, so the value of" "X 1 is about 0.4, and the value of X 2 is about 1.5." "Now, if you look at the data, it looks like, yeah, most of the data data lies in this region, and so that green cross is pretty far away from any of the data I've seen." "It looks like that should be raised as an anomaly." "So, in my data, in my, in the data of my good examples, it looks like, you know, the" "CPU load, and the memory use, they sort of grow linearly with each other." "So if I have a machine using lots of CPU, you know memory use will also be high, whereas this example, this green example it looks like here, the CPU load is very low, but the memory use" "is very high, and I just have not seen that before in my training set." "It looks like that should be an anomaly." "But let's see what the anomaly detection algorithm will do." "Well, for the CPU load, it puts it at around there" "0.5 and this reasonably high probability is not that far from other examples we've seen, maybe, whereas, for the memory use, this appointment, 0.5, whereas for the memory use, it's about 1.5, which is there." "Again, you know, it's all to us, it's not terribly Gaussian, but the value here and the value here is not that different from many other examples we've seen, and so P of" "X 1, will be pretty high, reasonably high." "P of X 2 reasonably high." "I mean, if you look at this plot right, this point here, it doesn't look that bad, and if you look at this plot, you know across here, doesn't look that bad." "I mean, I have had examples with even greater memory used, or with even less CPU use, and so this example doesn't look that anomalous." "And so, an anomaly detection algorithm will fail to flag this point as an anomaly." "And it turns out what our anomaly detection algorithm is doing is that it is not realizing that this blue ellipse shows the high probability region, is that, one of the thing is that, examples here, a high probability, and the" "examples, the next circle of from a lower probably, and examples here are even lower probability, and somehow, here are things that are, green cross there, it's pretty high probability," "and in particular, it tends to think that, you know, everything in this region, everything on the line that I'm circling over, has, you know, about equal probability, and it doesn't realize that something" "out here actually has much lower probability than something over there." "So, in order to fix this, we can, we're going to develop a modified version of the anomaly detection algorithm, using something called the multivariate" "Gaussian distribution also called the multivariate normal distribution." "So here's what we're going to do." "We have features x which are in Rn and instead of P of X 1, P of X 2, separately, we're going to model P of" "X, all in one go, so model P of X, you know, all at the same time." "So the parameters of the multivariate Gaussian distribution are mu, which is a vector, and sigma, which is an n by n matrix, called a covariance matrix," "and this is similar to the covariance matrix that we saw when we were working with the PCA, with the principal components analysis algorithm." "For the second complete is, let me just write out the formula" "for the multivariate Gaussian distribution." "So we say that probability of" "X, and this is parameterized by my parameters mu and sigma that the probability of x is equal to once again there's absolutely no need to memorize this formula." "You know, you can look it up whenever you need to use it, but this is what the probability of X looks like." "Transverse, 2nd inverse, X minus mu." "And this thing here, the absolute value of sigma, this thing here when you write this symbol, this is called the determent of sigma and this is a mathematical function of a matrix and you really don't need to know what the" "determinant of a matrix is, but really all you need to know is that you can compute it in octave by using the octave command DET of" "sigma." "Okay, and again, just be clear, alright?" "In this expression, these sigmas here, these are just n by n matrix." "This is not a summation and you know, the sigma there is an n by n matrix." "So that's the formula for P of X, but it's more interestingly, or more importantly," "what does P of X actually looks like?" "Lets look at some examples of multivariate Gaussian distributions." "So let's take a two dimensional example, say if" "I have N equals 2, I have two features, X 1 and X 2." "Lets say I set MU to be equal to 0 and sigma to be equal to this matrix here." "With 1s on the diagonals and 0s on the off-diagonals, this matrix is sometimes also called the identity matrix." "In that case, p of x will look like this, and what" "I'm showing in this figure is, you know, for a specific value of X1 and for a specific value of" "X2, the height of this surface the value of p of x." "And so with this setting the parameters p of x is highest when" "X1 and X2 equal zero 0, so that's the peak of this Gaussian distribution," "and the probability falls off with this sort of two dimensional Gaussian or this bell shaped two dimensional bell-shaped surface." "Down below is the same thing but plotted using a contour plot instead, or using different colors, and so this heavy intense red in the middle, corresponds to the highest values, and then the values decrease with the yellow being slightly lower" "values the cyan being lower values and this deep blue being the lowest values so this is really the same figure but plotted viewed from the top instead, using colors instead." "And so, with this distribution, you see that it faces most of the probability near 0,0 and then as you go out from 0,0 the probability of X1 and X2 goes down." "Now lets try varying some of the parameters and see what happens." "So let's take sigma and change it so let's say sigma shrinks a little bit." "Sigma is a covariance matrix and so it measures the variance or the variability of the features X1 X2." "So if the shrink sigma then what you get is what you get is that the width of this bump diminishes" "and the height also increases a bit, because the area under the surface is equal to 1." "So the integral of the volume under the surface is equal to 1, because probability distribution must integrate to one." "But, if you shrink the variance, it's kinda like shrinking sigma squared, you end up with a narrower distribution, and one that's a little bit taller." "And so you see here also the concentric ellipsis has shrunk a little bit." "Whereas in contrast if you were to increase sigma to 2 2 on the diagonals, so it is now two times the identity then you end up with a much wider and much flatter Gaussian." "And so the width of this is much wider." "This is hard to see but this is still a bell shaped bump, it's just flattened down a lot, it has become much wider and so the variance or the variability of X1 and X2 just becomes wider." "Here are a few more examples." "Now lets try varying one of the elements of sigma at the time." "Let's say I send sigma to 0.6 there, and 1 over there." "What this does, is this reduces the variance of" "the first feature, X 1, while keeping the variance of the second feature X 2, the same." "And so with this setting of parameters, you can model things like that." "X 1 has smaller variance, and" "X 2 has larger variance." "Whereas if I do this, if I set this matrix to 2, 1 then you can also model examples where you know here" "we'll say X1 can have take on a large range of values whereas X2 takes on a relatively narrower range of values." "And that's reflected in this figure as well, you know where, the distribution falls off more slowly as X 1 moves away from 0, and falls off very rapidly as X 2 moves away from 0." "And similarly if we were to modify this element of the matrix instead, then similar to the previous" "slide, except that here where you know playing around here saying that X2 can take on a very small range of values and so here if this is 0.6, we notice now X2" "tends to take on a much smaller range of values than the original example," "whereas if we were to set sigma to be equal to 2 then that's like saying X2 you know, has a much larger range of values." "Now, one of the cool things about the multivariate" "Gaussian distribution is that you can also use it to model correlations between the data." "That is we can use it to model the fact that" "X1 and X2 tend to be highly correlated with each other for example." "So specifically if you start to change the off diagonal entries of this covariance matrix you can get a different type of Gaussian distribution." "And so as I increase the off-diagonal entries from .5 to .8, what" "I get is this distribution that is more and more thinly peaked along this sort of x equals y line." "And so here the contour says that x and y tend to grow together and the things that are with large probability are if either X1 is large and" "Y2 is large or X1 is small and Y2 is small." "Or somewhere in between." "And as this entry, 0.8 gets large, you get a Gaussian distribution, that's sort of where all the probability lies on this sort of narrow region," "where x is approximately equal to y." "This is a very tall, thin distribution you know line mostly along this line" "central region where x is close to y." "So this is if we set these entries to be positive entries." "In contrast if we set these to negative values, as" "I decreases it to -.5 down to -.8, then what we get is a model where we put most of the probability in this sort of negative X one in the next 2 correlation region, and so, most of the" "probability now lies in this region, where X 1 is about equal to" "X 2, rather than X 1 equals X 2." "And so this captures a sort of negative correlation between x1" "and x2." "And so this is a hopefully this gives you a sense of the different distributions that the multivariate Gaussian distribution can capture." "So follow up in varying, the covariance matrix sigma, the other thing you can do is also, vary the mean parameter mu, and so operationally, we have mu equal 0 0, and so the distribution was centered around" "X 1 equals 0, X2 equals 0, so the peak of the distribution is here, whereas, if we vary the values of mu, then that varies the peak of the distribution and so, if mu equals 0, 0.5," "the peak is at, you know," "X1 equals zero, and X2 equals 0.5, and so the peak or the center of this distribution has shifted," "and if mu was 1.5 minus 0.5 then OK," "and similarly the peak of the distribution has now shifted to a different location, corresponding to where, you know," "X1 is 1.5 and X2 is -0.5, and so varying the mu parameter, just shifts around the center of this whole distribution." "So, hopefully, looking at all these different pictures gives you a sense of the sort of probability distributions that the Multivariate Gaussian Distribution allows you to capture." "And the key advantage of it is it allows you to capture, when you'd expect two different features to be positively correlated, or maybe negatively correlated." "In the next video, we'll take this multivariate Gaussian distribution and apply it to anomaly detection." "In the previous video, we gave a mathematical definition of gradient descent." "Let's delve deeper, and in this video, get better intuition about what the algorithm is doing, and why the steps of the gradient descent algorithm might make sense." "Here's the gradient descent algorithm that we saw last time." "And, just to remind you, this parameter, or this term, alpha, is called the learning rate." "And it controls how big a step we take when updating my parameter theta J. And this second term here is the derivative term." "And what I want to do in this video is give you better intuition about what each of these two terms is doing and why, when put together, this entire update makes sense." "In order to convey these intuitions, what" "I want to do is use a slightly simpler example where we want to minimize the function of just one parameter." "So, so we have a, say we have a cost function J of just one parameter, theta one, like we did, you know, a few videos back." "Where theta one is a real number, okay?" "Just so we can have 1D plots, which are a little bit simpler to look at." "And let's try to understand why gradient descent would do on this function." "[sound]." "So, let's say here's my function." "J of theta one, and so that's my, and where theta one is a real number." "Right, now let's say I've initialized gradient descent with theta one at this location." "So image that we start off at that point on my function." "What gradient descent will do, is it will update." "Theta one gets updated as theta one minus Alpha times DD theta one J of theta one, right?" "and oh an just as an aside you know this, this derivative term, right?" "If you're wondering why I changed the notation from these partial derivative symbols." "If you don't know what the difference is between these partial derivative symbols and the dd theta don't worry about it." "Technically in mathematics we call this a partial derivative, we call this a derivative, depending on the number of, of parameters in the function J, but that's a mathematical technicality, so, you know For the purpose of this lecture, think of" "these partial symbols, and DD theta one as exactly the same thing." "And, don't worry about whether there are any differences." "I'm gonna try to use the mathematically precise notation." "But for our purposes, these notations are really the same thing." "So, let's see what this, this equation will do." "And so we're going to compute this derivative of, I'm not sure if you've seen derivatives in calculus before." "But what a derivative, at this point, does, is basically saying, you know, let's." "Take the tangent to that point, like that straight line, the red line, just, just touching this function and let's look at the slope of this red line." "That's where the derivative is." "It says what's the slope of the line that is just tangent to the function, okay, and the slope of the line is of course is just right, you know just the height divided by this horizontal thing." "Now." "This line has a positive slope, so it has a positive derivative." "And so, my update to theta is going to be, theta one gives the update that theta one minus alpha times some positive number." "Okay?" "Alpha, the learning rate is always a positive number." "And so" "I'm gonna to take theta one, this update as theta one minus something." "So I'm gonna end up moving theta one to the left." "I'm gonna decrease theta one and we can see this is the right thing to do because I actually went ahead in this direction you know to get me closer to the minimum over there." "So, gradient descent so far seems to be doing the right thing." "Let's look at another example." "So let's take my same function j." "Just trying to draw the same function j of theta one." "And now let's say" "I had instead initialized my parameter over there on the left." "So theta one is here." "I'm gonna add that point on the surface." "Now, my derivative term, d, d theta one j of theta one, when evaluated at this point, gonna look at right." "The slope of that line." "So this derivative term is a slope of this line." "But this line is slanting down, so this line has negative slope." "Right?" "Or alternatively I say that this function has negative derivative, just means negative slope at that point." "So this is less than equal to zero." "So when I update theta, then if theta is updated as theta minus alpha times a negative number." "And so I have theta one minus a negative number which means I'm actually going to increase theta, right?" "Because this is minus of a negative number means I'm adding something to theta and what that means is that I'm going to end up increasing theta." "And so we'll start here and increase theta, which again seems like the thing I want to do to try to get me closer to the minimum." "So, this hopefully explains the intuition behind what the derivative term is doing." "Let's next take a look at the learning rate term alpha, and try to figure out what that's doing." "So, here's my gradient descent update rule." "Right, there's this equation And let's look at what can happen, if" "Alpha is either too small, or if Alpha is too large." "So this first example, what happens if Alpha is too small." "So here's my function J. J of theta." "Lets just start here." "If alpha is too small then what I'm going to do is gonna multiply the update by some small number." "So end up taking, you know, it's like a baby step like that." "Okay, so that's one step [inaudible]." "Then from this new point we're gonna take another step." "But if the alpha is too small lets take another little baby step." "And so if And so if my learning rate is too small." "I'm gonna end up, you know, taking these tiny, tiny baby steps to try to get to the minimum and I'm gonna need a lot of steps to get to the minimum and so." "If alpha's too small, can be slow because it's gonna take these tiny, tiny baby steps." "And it's gonna need a lot of steps before it gets anyway close to the global minimum." "Now, how about if the alpha is too large." "So here's my function J of theta." "Turns out if alpha is too large, then grading descent can overshoot a minimum and may even fail to converge or even diverge." "So here is what I mean." "Let's say [inaudible] ireful minimum So the derivative council" "It's actually close to the minimum." "So the derivative points to the right, but if alpha is too big, I'm gonna take a huge step." "Maybe I'm gonna take a huge step like that." "Right?" "So I end up taking a huge step." "Now, my cost function has got worse. cause it starts off from this value then now my value has gotten worse." "Now my derivatives, you know, points to the left, it's actually decrease theta." "But look, if my learning rate is too big," "I may take a huge step going from here all the way out there so I end up. going all there." "Right?" "And if my learning rate was too big I can take another huge step on the next acceleration and kind of overshoot and overshoot and so on until you notice" "I'm actually getting further and further away from the minimum." "And so if alpha is too large it can fail to converge or even diverge." "Now, I have another question for you." "So, this is a tricky one." "And when I was first learning this stuff, it actually took me a long time to figure this out." "What if your pre-emptive theta one is already at a local minimum?" "What do you think one step of gradient descent will do?" "So let's suppose you initialize theta one at a local minimum." "So you know suppose this is your initial value of theta one over here and it's already at a local optimum or the local minimum." "It sends out that at local optimum your derivative would be equal to zero." "Since it's that slope where it's that tangent point so the slope of this line will be equal to zero and thus this derivative term." "Is equal to zero." "And so, in your gradient descent update, you have theta one, gives update that theta one, minus alpha times zero." "And so, what this means is that, if you're already at a local optimum, it leaves theta one unchanged 'cause this, you know, gets the update's theta one equals theta one." "So if your parameter is already at a local minimum, one step of gradient descent does absolutely nothing." "It doesn't change the parameter, which is, which is what you want." "Cuz it keeps your solution at the local optimum." "This also explains why gradient descent can converge the local minimum, even with the learning rate Alpha fixed." "Here's what I mean by that." "Let's look at an example." "So here's a cost function J with theta." "That maybe I want to minimize and let's say I initialize my algorithm my gradient descent algorithm, you know, out there at that magenta point." "If I take one step of gradient descent you know, maybe it'll take me to that point cuz my derivative's pretty steep out there, right?" "Now I'm at this green point and if I take another step of gradient descent, you notice that my derivative, meaning the slope, is less steep at the green point when compared to at the magenta point out there, right?" "Because as I approach the minimum my derivative gets closer and closer to zero as I approach the minimum." "So, after one step of gradient descent, my new derivative is a little bit smaller." "So I wanna take another step of gradient descent." "I will naturally take a somewhat smaller step from this green point than I did from the magenta point." "Now I'm at the new point, the red point, and then now even closer to global minimums, so the derivative here will be even smaller than it was at the green point." "So when I take another step of gradient descent, you know, now my derivative term is even smaller, and so the magnitude of the update to theta one is even smaller, so you can take small step like so, and as gradient descent runs." "You will automatically take smaller and smaller steps until eventually you are taking very small steps, you know, and you find the converge to the to the local minimum." "So, just to recap." "In gradient descent as we approach the local minimum, grading descent will automatically take smaller steps and that's because as we approach the local minimum, by definition, local minimum is when you have this derivative equal to zero." "So as we approach the local minimum this derivative term will automatically get smaller and so gradient descent will automatically take smaller step." "So, this is what gradient descent looks like, and so actually there is no need to decrease alpha overtime." "So, that's the gradient descent algorithm, and you can use it to minimize, to try to minimize any cost function J. Not the cost function J to be defined for linear regression." "In the next video, we're going to take the function J, and set that back to be exactly linear regression's cost function." "The, the square cost function that we came up with earlier." "And taking gradient descent, and the square cost function, and putting them together." "That will give us our first learning algorithm, that'll give us our linear regression algorithm." "In previous videos, we talked about the gradient descent algorithm and talked about the linear regression model and the squared error cost function." "In this video, we're going to put together gradient descent with our cost function, and that will give us an algorithm for linear regression for fitting a straight line to our data." "So, this is what we worked out in the previous videos." "That's our gradient descent algorithm, which should be familiar, and you see the linear linear regression model with our linear hypothesis and our squared error cost function." "What we're going to do is apply gradient descent to minimize" "our squared error cost function." "Now, in order to apply gradient descent, in order to write this piece of code, the key term we need is this derivative term over here." "So, we need to figure out what is this partial derivative term, and plug in the definition of the cost function J, this turns out to be this "inaudible"" "equals sum 1 through M of this squared error" "cost function term, and all" "I did here was I just you know plugged in the definition of the cost function there, and simplifying little bit more, this turns out to be equal to, this" ""inaudible" equals sum 1 through M of tetha zero plus tetha one, XI minus YI squared." "And all I did there was took the definition for my hypothesis and plug that in there." "And it turns out we need to figure out what is the partial derivative of two cases for J equals" "0 and for J equals 1 want to figure out what is this partial derivative for both the theta(0) case and the theta(1) case." "And I'm just going to write out the answers." "It turns out this first term simplifies to 1/M, sum over my training set of just that, X(i)" " Y(i)." "And for this term, partial derivative with respect to theta(1), it turns out I get this term:" "Y(i)X(i)." "Okay." "And computing these partial derivatives, so going from this equation to either of these equations down there, computing those partial derivative terms requires some multivariate calculus." "If you know calculus, feel free to work through the derivations yourself and check take the derivatives you actually get the answers that I got." "But if you are less familiar with calculus you don't worry about it, and it is fine to just take these equations worked out, and you won't need to know calculus or anything like that in order to" "do the homework, so to implement gradient descent you'd get that to work." "But so, after these definitions, or after what we've worked out to be the derivatives, which is really just the slope of the cost function J. We can now plug them back into our gradient descent algorithm." "So here's gradient descent, or the regression, which is going to repeat until convergence, theta 0 and theta one get updated as, you know, the same minus alpha times the derivative term." "So, this term here." "So, here's our linear regression algorithm." "This first term here that term is, of course, just a partial derivative of respective theta zero, that we worked on in the previous slide." "And this second term here, that term is just a partial derivative with respect to theta one that we worked out on the previous line." "And just as a quick reminder, you must, when implementing gradient descent, there's actually there's detail that, you know, you should be implementing it so the update theta zero and theta one simultaneously." "So, let's see how gradient descent works." "One of the issues we solved gradient descent is that it can be susceptible to local optima." "So, when I first explained gradient descent, I showed you this picture of it, you know, going downhill on the surface and we saw how, depending on where you're initializing, you can end up with different local optima." "You know, you can end up here or here." "But, it turns out that the cost function for gradient of cost function for linear regression is always going to be a bow-shaped function like this." "The technical term for this is that this is called a convex function." "And I'm not going to give the formal definition for what is a convex function, c-o-n-v-e-x, but informally a convex function means a bow-shaped function, you know, kind of like a bow shaped." "And so, this function doesn't have any local optima, except for the one global optimum." "And does gradient descent on this type of cost function which you get whenever you're using linear regression, it will always convert to the global optimum, because there are no other local optima other than global optimum." "So now, let's see this algorithm in action." "As usual, here are plots of the hypothesis function and of my cost function J." "And so, let's see how to initialize my parameters at this value." "You know, let's say, usually you initialize your parameters at zero for zero, theta zero and zero." "For illustration in this specific presentation, I have initialised theta zero at about 900, and theta one at about minus 0.1, okay?" "And so, this corresponds to H over X, equals, you know, minus 900 minus 0.1 x is this line, so out here on the cost function." "Now if we take one step of gradient descent, we end up going from this point out here, a little bit to the down left to that second point over there." "And, you notice that my line changed a little bit." "And, as I take another step at gradient descent, my line on the left will change." "Right." "And I have also moved to a new point on my cost function." "And as I think further step is gradient descent, I'm going down in cost, right, so my parameter is following this trajectory, and if you look on the left, this corresponds to hypotheses that seem" "to be getting to be better and better fits for the data until eventually," "I have now wound up at the global minimum." "And this global minimum corresponds to this hypothesis, which gives me a good fit to the data." "And so that's gradient descent, and we've just run it and gotten a good fit to my data set of housing prices." "And you can now use it to predict." "You know, if your friend has a house with a size 1250 square feet, you can now read off the value and tell them that, I don't know, maybe they can get" "$350,000 for their house." "Finally, just to give this another name, it turns out that the algorithm that we just went over is sometimes called batch gradient descent." "And it turns out in machine learning, I feel like us machine learning people, we're not always created has given me some algorithms." "But the term batch gradient descent means that refers to the fact that, in every step of gradient descent we're looking at all of the training examples." "So, in gradient descent, you know, when computing derivatives, we're computing these sums, this sum of." "So, in every separate gradient descent, we end up computing something like this, that sums over our M training examples." "And so the term batch gradient descent refers to the fact when looking at the entire batch of training examples, and again, this is really, really not a great name, but this is what Machine Learning people call it." "And it turns out there are sometimes other versions of gradient descent that are not batch versions but instead do not look at the entire traning set but look at small subsets of the training sets at the time, and we'll talk about those versions later in this course as well." "But for now, using the algorithm you just learned, now we're using batch gradient descent, you now know how to implement gradient descent, or linear regression." "So that's linear regression with gradient descent." "If you've seen advanced linear algebra before so some you may have taken a class with advanced linear algebra, you might know that there exists a solution for numerically solving for the minimum of the cost function" "J, without needing to use and iterative algorithm like gradient descent." "Later in this course we will talk about that method as well that just solves for the minimum cost function J without needing this multiple steps of gradient descent." "That other method is called normal equations methods." "And, but in case you have heard of that method, it turns out gradient descent will scale better to larger data sets than that normal equals method and, now that we know about gradient descent, we'll" "be able to use it in lots of different contexts, and we'll use it in lots of different Machine Learning problems as well." "So, congrats on learning about your first Machine Learning algorithm." "We'll later have exercises in which we'll ask you to implement gradient descent and hopefully see these algorithms work for yourselves." "But before that I first want to tell you in the next set of videos, the first want to tell you about a generalization of the gradient descent algorithm that will make it much more powerful and I guess I will tell you about that in the next video." "You now know about linear regression and gradient descent." "The plan from here on out is to tell you about a couple of important extensions of these ideas." "Concretely here they are." "First it turns out that in order to solve this minimization problem, turns out there's an algorithm for solving for theta zero and theta one exactly without needing an iterative algorithm." "Without needing this algorithm like gradient descent that we had to iterate, you know, multiple times over." "So it turns out there are advantages and disadvantages of this algorithm that lets you just solve for theta zero and theta one, basically in just one shot." "One advantage is that there is no longer a learning rate alpha that you need to worry about and set." "And so it can be much faster for some problems." "We'll talk about its advantages and disadvantages later." "Second, we'll also talk about algorithms for learning with a larger number of features." "So, so far we've been learning with just one feature, the size of the house and using that to predict the price, so we're trying to take x and use that to predict y." "But for other learning problems we may have a larger number of features." "So for example let's say that you know not only the size, but also the number of bedrooms, the number of floors, and the age of these houses." "And you want to use that to predict the price of the houses." "In that case maybe we'll call these features x1, x2, x3, and x4." "So now we have, you know, four features." "We want to use these four features to predict why the price of the house." "It turns out with all of these features, four of them in this case, it turns out that with multiple features it becomes harder to plot or visualize the data." "So for example here we try to plot this type of data set." "Maybe we will have the vertical axis be the price and maybe we can have one axis here, and another one here where this axis is the size of the house, and that axis is the number of bedrooms." "You know, but this is just plotting, right my first two features: size and number of bedrooms." "And when we have these additional features" "I don't know, I just don't know how to plot all of these data, right cuz I need like a 4-dimensional or 5-dimensional figure." "I don't really know how to plot you know something more than like a 3-dimensional figure, like, like what I have over here." "Also as you can tell, the notation starts to get a little more complicated, right." "So rather than just having x our features we now have x1 through x4." "And we're using these subscripts to denote my four different features." "It turns out the best notation to keep all of this straight and to understand what's going on with the data even when we don't quite know how to plot it." "It turns out that the best notation is the notation of linear algebra." "Linear algebra gives us a notation and a set of things or a set of operations that we can do with matrices and vectors." "For example." "Here's a matrix where the columns of this matrix are:" "The first column is the sizes of the four houses, the second column was the number of bedrooms, that's the number of floors and that was the age of the home." "And so a matrix is a block of numbers that lets me take all of my data, all of my x's." "All of my features and organize them efficiently into sort of one big block of numbers like that." "And here is what we call a vector in linear algebra where the four numbers here are the prices of the four houses that we saw on the previous slide." "So." "In the next set of videos what I'm going to do is do a quick review of linear algebra." "If you haven't seen matrices and vectors before, so if all of this, everything on this slide is brand new to you or if you've seen linear algebra before, but it's been a while so you aren't completely familiar with it" "anymore, then please watch the next set of videos." "And I'll quickly review the linear algebra you need in order to implement and use the more powerful versions of linear regression." "It turns out linear algebra isn't just useful for linear regression models but these ideas of matrices and vectors will be useful for helping us to implement and actually get computationally efficient implementations for many later machines learning models as well." "And as you can tell these sorts of matrices and vectors will give us an efficient way to start to organize large amounts of data, when we work with larger training sets." "So, in case, in case you're not familiar with linear algebra or in case linear algebra seems like a complicated, scary concept for those of you who've never seen it before, don't worry about" "it." "It turns out in order to implement machine learning algorithms we need only the very, very basics of linear algebra and you'll be able to very quickly pick up everything you need to know in the next few videos." "Concretely, to decide if you should watch the next set of videos, here are the topics I'm going to cover." "Talk about what are matrices and vectors." "Talk about how to add, subtract, multiply matrices and vectors." "Talk about the ideas of matrix inverses and transposes." "And so, if you are not sure if you should watch the next set of videos take a look at these two things." "So if you think you know how to compute this quantity, this matrix transpose times another matrix." "If you think you know, if you have seen this stuff before, if you know how to compute the inverse of matrix times a vector, minus a number, times another vector." "If these two things look completely familiar to you then you can safely skip the optional set of videos on linear algebra." "But if these, concepts, if you're slightly uncertain what these blocks of numbers or these matrices of numbers mean, then please take a look of the next set of videos and, it'll very quickly teach you what you need to know about linear algebra in" "order to program machine learning algorithms and deal with large amounts of data." "Let's get started with our linear algebra review." "In this video I want to tell you what are matrices and what are vectors." "A matrix is a rectangular array of numbers written between square brackets." "So, for example, here is a matrix on the right, a left square bracket." "And then, write in a bunch of numbers." "These could be features from a learning problem or it could be data from somewhere else, but" "the specific values don't matter, and then I'm going to close it with another right bracket on the right." "And so that's one matrix." "And, here's another example of the matrix, let's write 3, 4, 5,6." "So matrix is just another way for saying, is a" "2D or a two dimensional array." "And the other piece of knowledge that we need is that the dimension of the matrix is going to be written as the number of row times the number of columns in the matrix." "So, concretely, this example on the left, this has 1, 2, 3, 4 rows and has 2 columns," "and so this example on the left is a 4 by" "2 matrix - number of rows by number of columns." "So, four rows, two columns." "This one on the right, this matrix has two rows." "That's the first row, that's the second row, and it has three columns." "That's the first column, that's the second column, that's the third column So, this second matrix we say it is a 2 by 3 matrix." "So we say that the dimension of this matrix is 2 by 3." "Sometimes you also see this written out, in the case of left, you will see this written out as R4 by 2 or concretely what people will sometimes say this matrix is an element of the set R 4 by 2." "So, this thing here, this just means the set of all matrices that of dimension" "4 by 2 and this thing on the right, sometimes this is written out as a matrix that is an R 2 by 3." "So if you ever see, 2 by 3." "So if you ever see something like this are 4 by" "2 or are 2 by 3, people are just referring to matrices of a specific dimension." "Next, let's talk about how to refer to specific elements of the matrix." "And by matrix elements, other than the matrix I just mean the entries, so the numbers inside the matrix." "So, in the standard notation, if A is this matrix here, then A sub-strip" "IJ is going to refer to the i, j entry, meaning the entry in the matrix in the ith row and jth column." "So for example a1-1 is going to refer to the entry in the 1st row and the 1st column, so that's the first row and the first column and so a1-1 is going to be equal to" "1, 4, 0, 2." "Another example, 8 1 2 is going to refer to the entry in the first row and the second column and so A" "1 2 is going to be equal to one nine one." "This come from a quick examples." "Let's see, A, oh let's say A 3 2, is going to refer to the entry in the 3rd row, and second column," "right, because that's 3 2 so that's equal to 1 4 3 7." "And finally, 8 4 1 is going to refer to this one right, fourth row, first column is equal to" "1 4 7 and if, hopefully you won't, but if you were to write and say well this A 4" "3, well, that refers to the fourth row, and the third column that, you know, this matrix has no third column so this is undefined," "you know, or you can think of this as an error." "There's no such element as 8 4 3, so, you know, you shouldn't be referring to 8 4 3." "So, the matrix gets you a way of letting you quickly organize, index and access lots of data." "In case I seem to be tossing up a lot of concepts, a lot of new notations very rapidly, you don't need to memorize all of this, but on the course website where we have posted the lecture notes," "we also have all of these definitions written down." "So you can always refer back, you know, either to these slides, possible coursework, so audible lecture notes if you forget well, A41 was that?" "Which row, which column was that?" "Don't worry about memorizing everything now." "You can always refer back to the written materials on the course website, and use that as a reference." "So that's what a matrix is." "Next, let's talk about what is a vector." "A vector turns out to be a special case of a matrix." "A vector is a matrix that has only 1 column so you have an N x 1 matrix, then that's a remember, right?" "N is the number of rows, and 1 here is the number of columns, so, so matrix with just one column is what we call a vector." "So here's an example of a vector, with I guess I have N equals four elements here." "so we also call this thing, another term for this is a four dmensional" "vector, just means that this is a vector with four elements, with four numbers in it." "And, just as earlier for matrices you saw this notation R3 by 2 to refer to 2 by" "3 matrices, for this vector we are going to refer to this as a vector in the set R4." "So this R4 means a set of four-dimensional vectors." "Next let's talk about how to refer to the elements of the vector." "We are going to use the notation yi to refer to the ith element of the vector y." "So if y is this vector, y subscript i is the ith element." "So y1 is the first element,four sixty, y2 is equal to the second element," "two thirty two -there's the first." "There's the second." "Y3 is equal to 315 and so on, and only y1 through y4 are defined consistency 4-dimensional vector." "Also it turns out that there are actually 2 conventions for how to index into a vector and here they are." "Sometimes, people will use one index and sometimes zero index factors." "So this example on the left is a one in that specter where the element we write is y1, y2, y3, y4." "And this example in the right is an example of a zero index factor where we start the indexing of the elements from zero." "So the elements go from a zero up to y three." "And this is a bit like the arrays of some primary languages" "where the arrays can either be indexed starting from one." "The first element of an array is sometimes a Y1, this is sequence notation I guess, and sometimes it's zero index depending on what programming language you use." "So it turns out that in most of math, the one index version is more common For a lot of machine learning applications, zero index" "vectors gives us a more convenient notation." "So what you should usually do is, unless otherwised specified," "you should assume we are using one index vectors." "In fact, throughout the rest of these videos on linear algebra review, I will be using one index vectors." "But just be aware that when we are talking about machine learning applications, sometimes I will explicitly say when we need to switch to, when we need to use the zero index" "vectors as well." "Finally, by convention, usually when writing matrices and vectors, most people will use upper case to refer to matrices." "So we're going to use capital letters like" "A, B, C, you know," "X, to refer to matrices, and usually we'll use lowercase, like a, b, x, y," "to refer to either numbers, or just raw numbers or scalars or to vectors." "This isn't always true but this is the more common notation where we use lower case "Y" for referring to vector and we usually use upper case to refer to a matrix." "So, you now know what are matrices and vectors." "Next, we'll talk about some of the things you can do with them" "In this video we'll talk about matrix addition and subtraction, as well as how to multiply a matrix by a number, also called Scalar Multiplication." "Let's start an example." "Given two matrices like these, let's say I want to add them together." "How do I do that?" "And so, what does addition of matrices mean?" "It turns out that if you want to add two matrices, what you do is you just add up the elements of these matrices one at a time." "So, my result of adding two matrices is going to be itself another matrix and the first element again just by taking one and four and multiplying them and adding them together, so I get five." "The second element I get by taking two and two and adding them, so I get four; three plus three plus zero is three, and so on." "I'm going to stop changing colors, I guess." "And, on the right is open five, ten and two." "And it turns out you can add only two matrices that are of the same dimensions." "So this example is a three by two matrix," "because this has 3 rows and 2 columns, so it's 3 by 2." "This is also a 3 by 2 matrix, and the result of adding these two matrices is a 3 by 2 matrix again." "So you can only add matrices of the same dimension, and the result will be another matrix that's of the same dimension as the ones you just added." "Where as in contrast, if you were to take these two matrices, so this one is a 3 by" "2 matrix, okay, 3 rows, 2 columns." "This here is a 2 by 2 matrix." "And because these two matrices are not of the same dimension, you know, this is an error, so you cannot add these two matrices and, you know, their sum is not well-defined." "So that's matrix addition." "Next, let's talk about multiplying matrices by a scalar number." "And the scalar is just a, maybe a overly fancy term for, you know, a number or a real number." "Alright, this means real number." "So let's take the number 3 and multiply it by this matrix." "And if you do that, the result is pretty much what you'll expect." "You just take your elements of the matrix and multiply them by 3, one at a time." "So, you know, one times three is three." "What, two times three is six, 3 times 3 is 9, and let's see, I'm going to stop changing colors again." "Zero times 3 is zero." "Three times 5 is 15, and 3 times 1 is three." "And so this matrix is the result of multiplying that matrix on the left by 3." "And you notice, again, this is a 3 by 2 matrix and the result is a matrix of the same dimension." "This is a 3 by 2, both of these are" "3 by 2 dimensional matrices." "And by the way, you can write multiplication, you know, either way." "So, I have three times this matrix." "I could also have written this matrix and 0, 2, 5, 3, 1, right." "I just copied this matrix over to the right." "I can also take this matrix and multiply this by three." "So whether it's you know, 3 times the matrix or the matrix times three is the same thing and this thing here in the middle is the result." "You can also take a matrix and divide it by a number." "So, turns out taking this matrix and dividing it by four, this is actually the same as taking the number one quarter, and multiplying it by this matrix." "4, 0, 6, 3 and so, you can figure the answer, the result of this product is, one quarter times four is one, one quarter times zero is zero." "One quarter times six is, what, three halves, about six over four is three halves, and one quarter times three is three quarters." "And so that's the results of computing this matrix divided by four." "Vectors give you the result." "Finally, for a slightly more complicated example, you can also take these operations and combine them together." "So in this calculation, I have three times a vector plus a vector minus another vector divided by three." "So just make sure we know where these are, right." "This multiplication." "This is an example of scalar multiplication because I am taking three and multiplying it." "And this is, you know, another scalar multiplication." "Or more like scalar division, I guess." "It really just means one zero times this." "And so if we evaluate these two operations first, then what we get is this thing is equal to, let's see, so three times that vector is three, twelve, six, plus my vector in the middle which" "is a 005 minus one, zero, two-thirds, right?" "And again, just to make sure we understand what is going on here, this plus symbol, that is matrix addition, right?" "I really, since these are vectors, remember, vectors are special cases of matrices, right?" "This, you can also call this vector addition This minus sign here, this is again a matrix subtraction, but because this is an n by 1, really a three by one matrix, that this is actually a vector, so this is" "also vector, this column." "We call this matrix a vector subtraction, as well." "OK?" "And finally to wrap this up." "This therefore gives me a vector, whose first element is going to be 3+0-1, so that's 3-1, which is 2." "The second element is 12+0-0, which is 12." "And the third element of this is, what, 6+5-(2/3), which is 11-(2/3), so that's 10 and one-third and see, you close this square bracket." "And so this gives me a 3 by 1 matrix, which is also just called a 3 dimensional vector, which is the outcome of this calculation over here." "So that's how you add and subtract matrices and vectors and multiply them by scalars or by row numbers." "So far I have only talked about how to multiply matrices and vectors by scalars, by row numbers." "In the next video we will talk about a much more interesting step, of taking" "2 matrices and multiplying 2 matrices together." "In this video, I'd like to start talking about how to multiply together two matrices." "We'll start with a special case of that, of matrix vector multiplication - multiplying a matrix together with a vector." "Let's start with an example." "Here is a matrix, and here is a vector, and let's say we want to multiply together this matrix with this vector, what's the result?" "Let me just work through this example and then we can step back and look at just what the steps were." "It turns out the result of this multiplication process is going to be, itself, a vector." "And I'm just going work with this first and later we'll come back and see just what I did here." "To get the first element of this vector I am going to take these two numbers and multiply them with the first row of the matrix and add up the corresponding numbers." "Take one multiplied by one, and take three and multiply it by five, and that's what, that's one plus fifteen so that gives me sixteen." "I'm going to write sixteen here." "then for the second row, second element, I am going to take the second row and multiply it by this vector, so I have four times one, plus zero times five, which is equal to four, so you'll have four there." "And finally for the last one I have two one times one five, so two by one, plus one by 5, which is equal to a 7, and so I get a 7 over there." "It turns out that the results of multiplying that's a 3x2 matrix by a" "2x1 matrix is also just a two-dimensional vector." "The result of this is going to be a 3x1 matrix, so that's why three by one 3x1 matrix, in other words a 3x1 matrix is just a three dimensional vector." "So I realize that I did that pretty quickly, and you're probably not sure that you can repeat this process yourself, but let's look in more detail at what just happened and what this process of multiplying a matrix by a vector looks like." "Here's the details of how to multiply a matrix by a vector." "Let's say I have a matrix A and want to multiply it by a vector x." "The result is going to be some vector y." "So the matrix A is a m by n dimensional matrix, so m rows and n columns and we are going to multiply that by a n by 1 matrix, in other words an n dimensional vector." "It turns out this" ""n" here has to match this "n" here." "In other words, the number of columns in this matrix, so it's the number of n columns." "The number of columns here has to match the number of rows here." "It has to match the dimension of this vector." "And the result of this product is going to be an n-dimensional vector y." "Rows here." ""M" is going to be equal to the number of rows in this matrix "A"." "So how do you actually compute this vector "Y"?" "Well it turns out to compute this vector "Y", the process is to get "Y""I", multiply "A's"" ""I'th" row with the elements of the vector "X"" "and add them up." "So here's what I mean." "In order to get the first element of "Y", that first number--whatever that turns out to be--we're gonna take the first row of the matrix "A" and multiply them one at a time with the elements of this vector "X"." "So I take this first number multiply it by this first number." "Then take the second number multiply it by this second number." "Take this third number whatever that is, multiply it the third number and so on until you get to the end." "And I'm gonna add up the results of these products and the result of paying that out is going to give us this first element of "Y"." "Then when we want to get the second element of "Y", let's say this element." "The way we do that is we take the second row of" "A and we repeat the whole thing." "So we take the second row of A, and multiply it elements-wise, so the elements of X and add up the results of the products and that would give me the second element of Y. And you keep going to get and we" "going to take the third row of A, multiply element Ys with the vector x, sum up the results and then" "I get the third element and so on, until I get down to the last row like so, okay?" "So that's the procedure." "Let's do one more example." "Here's the example:" "So let's look at the dimensions." "Here, this is a three by four dimensional matrix." "This is a four-dimensional vector, or a 4 x 1 matrix, and so the result of this, the result of this product is going to be a three-dimensional vector." "Write, you know, the vector, with room for three elements." "Let's do the, let's carry out the products." "So for the first element, I'm going to take these four numbers and multiply them with the vector X. So I have" "1x1, plus 2x3, plus 1x2, plus 5x1, which is equal to - that's" "1+6, plus 2+6, which gives me 14." "And then for the second element, I'm going to take this row now and multiply it with this vector (0x1)+3." "All right, so 0x1+ 3x3 plus" "0x2 plus 4x1, which is equal to, let's see that's 9+4, which is 13." "And finally, for the last element, I'm going to take this last row, so I have minus one times one." "You have minus two, or really there's a plus next to a two I guess." "Times three plus zero times two plus zero times one, and so that's going to be minus one minus six, which is going to make this seven, and so that's vector seven." "Okay?" "So my final answer is this vector fourteen, just to write to that without the colors, fourteen, thirteen, negative seven." "And as promised, the result here is a three by one matrix." "So that's how you multiply a matrix and a vector." "I know that a lot just happened on this slide, so if you're not quite sure where all these numbers went, you know, feel free to pause the video you know, and so take a slow careful look at this" "big calculation that we just did and try to make sure that you understand the steps of what just happened to get us these numbers,fourteen, thirteen and eleven." "Finally, let me show you a neat trick." "Let's say we have a set of four houses so 4 houses with 4 sizes like these." "And let's say I have a hypotheses for predicting what is the price of a house, and let's say I want to compute, you know, H of X for each of my 4 houses here." "It turns out there's neat way of posing this, applying this hypothesis to all of my houses at the same time." "It turns out there's a neat way to pose this as a" "Matrix Vector multiplication." "So, here's how I'm going to do it." "I am going to construct a matrix as follows." "My matrix is going to be 1111 times, and I'm going to write down the sizes of my four houses here and" "I'm going to construct a vector as well, And my vector is going to this vector of two elements, that's minus 40 and 0.25." "That's these two co-efficients;" "data 0 and data 1." "And what I am going to do is to take matrix and that vector and multiply them together, that times is that multiplication symbol." "So what do I get?" "Well this is a four by two matrix." "This is a two by one matrix." "So the outcome is going to be a four by one vector, all right." "So, let me, so this is going to be a 4 by" "1 matrix is the outcome or really a four diminsonal vector, so let me write it as one of my four elements in my four real numbers here." "Now it turns out and so this first element of this result, the way I am going to get that is, I am going to take this and multiply it by the vector." "And so this is going to be -40 x" "1 + 4.25 x 2104." "By the way, on the earlier slides I was writing 1 x -40 and" "2104 x 0.25, but the order doesn't matter, right?" "40 x 1 is the same as 1 x -40." "And this first element, of course, is "H" applied to 2104." "So it's really the predicted price of my first house." "Well, how about the second element?" "Hope you can see where I am going to get the second element." "Right?" "I'm gonna take this and multiply it by my vector." "And so that's gonna be" "40 x 1 + 0.25 x 1416." "And so this is going be "H" of 1416." "Right?" "And so on for the third and the fourth elements of this 4 x 1 vector." "And just there, right?" "This thing here that I just drew the green box around, that's a real number, OK?" "That's a single real number, and this thing here that" "I drew the magenta box around--the purple, magenta color box around--that's a real number, right?" "And so this thing on the right--this thing on the right overall, this is a" "4 by 1 dimensional matrix, was a 4 dimensional vector." "And, the neat thing about this is that when you're actually implementing this in software--so when you have four houses and when you want to use your hypothesis to predict the prices, predict the price "Y" of all of these four houses." "What this means is that, you know, you can write this in one line of code." "When we talk about octave and program languages later, you can actually, you'll actually write this in one line of code." "You write prediction equals my, you know, data matrix times parameters, right?" "Where data matrix is this thing here, and parameters is this thing here, and this times is a matrix vector multiplication." "And if you just do this then this variable prediction - sorry for my bad handwriting - then just implement this one line of code assuming you have an appropriate library to do matrix vector multiplication." "If you just do this, then prediction becomes this" "4 by 1 dimensional vector, on the right, that just gives you all the predicted prices." "And your alternative to doing this as a matrix vector multiplication would be to write eomething like" ", you know, for I equals 1 to 4, right?" "And you have say a thousand houses it would be for I equals 1 to a thousand or whatever." "And then you have to write a prediction, you know, if I equals." "and then do a bunch more work over there and it turns out that When you have a large number of houses, if you're trying to predict the prices of not just four but maybe of a thousand houses then" "it turns out that when you implement this in the computer, implementing it like this, in any of the various languages." "This is not only true for" "Octave, but for Supra Server" "Java or Python, other high-level, other languages as well." "It turns out, that, by writing code in this style on the left, it allows you to not only simplify the code, because, now, you're just writing one line of code rather than the form of a bunch of things inside." "But, for subtle reasons, that we will see later, it turns out to be much more computationally efficient to make predictions on all of the prices of all of your houses doing it the way on the left than the" "way on the right than if you were to write your own formula." "I'll say more about this later when we talk about vectorization, but, so, by posing a prediction this way, you get not only a simpler piece of code, but a more efficient one." "So, that's it for matrix vector multiplication and we'll make good use of these sorts of operations as we develop the living regression in other models further." "But, in the next video we're going to take this and generalize this to the case of matrix matrix multiplication." "In this video we talk about matrix, matrix multiplication or how to multiply two matrices together." "When we talk about the method in linear regression for how to solve for the parameters, theta zero and theta one, all in one shot." "So, without needing an iterative algorithm like gradient descent." "When we talk about that algorithm, it turns out that matrix, matrix multiplication is one of the key steps that you need to know." "So, let's, as usual, start with an example." "Let's say I have two matrices and I want to multiply them together." "Let me again just reference this example and then I'll tell you in a little bit what happens." "So, the first thing" "I'm gonna do is, I'm going to pull out the first column of this matrix on the right." "And I'm going to take this matrix on the left and multiply it by, you know, a vector." "That's just this first column, OK?" "And it turns out if I do that I am going to get the vector 11, 9." "So, this is the same matrix vector multiplication as you saw in the last videos." "I worked this out in advance so, I know it's 11, 9." "And, then, the second thing" "I'm going to do is, I'm going to pull out the second column, this matrix on the right and" "I am then going to take this matrix on the left, right, so, it will be that matrix, and multiply it by that second column on the right." "So, again, this is a matrix vector multiplication set, which you saw from the previous video, and it turns out that if you multiply this matrix and this vector, you get 10," "14 and by the way, if you want to practice your matrix vector multiplication, feel free to pause the video and check this product yourself." "Then, I'm just going to take these two results and put them together, and that will be my answer." "So, turns out the outcome of this product is going to be a 2 by 2 matrix, and" "The way I am going to fill in this matrix is just by taking my elements 11," "9 and plugging them here, and taking 10, 14 and plugging them into the second column." "Okay?" "So, that was the mechanics of how to multiply a matrix by another matrix." "You basically look at the second matrix one column at a time, and you assemble the answers." "And again, we will step through this much more carefully in a second, but I just want to point out also, this first example is a 2x3 matrix matrix." "Multiplying that by a 3x2 matrix, and the outcome of this product, it turns out to be a 2x2 matrix." "And again, we'll see in a second why this was the case." "All right." "That was the mechanics of the calculation." "Let's actually look at the details and look at what exactly happened." "Here are details." "I have a matrix A and" "I want to multiply that with a matrix B, and the result will be some new matrix C. And it turns out you can only multiply together matrices whose dimensions match so A is an m by n matrix," "so m columns, n columns and" "I am going to multiply that with an n by o and it turns out this n here must match this n here, so the number of columns in first matrix must equal to the number of rows in second matrix." "And the result of this product will be an M by O matrix, like the the matrix C here." "And, in the previous video, everything we did corresponded to this special case of OB equal to 1." "Okay?" "That was, that was in case of B being a vector." "But now, we are going to view of the case of values of O larger than 1." "So, here's how you multiply together the two matrices." "In order to get, what" "I am going to do is" "I am going to take the first column of B and treat that as a vector, and multiply the matrix A, with the first column of B, and the result of that will be a M by 1 vector," "and we're going to put that over here." "Then, I'm going to take the second column of B, right, so, this is another n by one vector, so, this column here, this is right, n by one, those are n dimensional" "vector, gonna multiply this matrix with this n by one vector." "The result will be a M dimensional vector, which we'll put there." "And, so on." "Okay?" "And, so, you know, and then" "I'm going to take the third column, multiply it by this matrix, I get a M dimensional vector." "And so on, until you get to the last column times, the matrix times the lost column gives you the lost column of C." "Just to say that again." "The ith column of the matrix C is attained by taking the matrix A and multiplying the matrix A with the ith column of the matrix B for the values of I equals 1, 2 up through O. Okay ?" "So, this is just a summary of what we did up there in order to compute the matrix" "C. Let's look at just one more example." "Let 's say, I want to multiply together these two matrices." "So, what I'm going to do is, first pull out the first column of my second matrix, that was matrix B, that was my matrix B on the previous slide." "And, I therefore, have this matrix times my vector and so, oh, let's do this calculation quickly." "There's going to be equal to, right, 1, 3 times 0," "3 so that gives 1 times 0, plus 3 times 3." "And, the second element is going to be 2," "5 times 0, 3 so, that's going to be two times 0 plus 5 times 3 and that is" "9,15, actually didn't write that in green, so this is nine fifteen, and then mix." "I am going to pull out the second column of this, and do the corresponding calculation so there's this matrix times this vector 1, 2." "Let's also do this quickly, so that's one times one plus three times two." "So that deals with that row, let's do the other one, so let's see, that gives me two times one plus times two, so that is going to be equal to, let's see," "one times one plus three times one is four and two times one plus five times two is twelve." "So now I have these two you, and so my outcome, so the product of these two matrices is going to be, this goes here and this goes here, so I get nine fifteen and four twelve and you" "may notice also that the result of multiplying a 2x2 matrix with another 2x2 matrix." "The resulting dimension is going to be that first two times that second two, so the result is itself also a two by two matrix." "Finally let me show you one more neat trick you can do with matrix matrix multiplication." "Let's say as before that we have four houses whose prices we want to predict, only now we have three competing hypothesis shown here on the right, so if you want to So apply all" "3 competing hypotheses to all four of the houses, it turns out you can do that very efficiently using a matrix matrix multiplication so here on the left is my usual matrix, same as from the last video where these values" "are my housing prices and I put ones there on the left as well." "And, what I'm going to do is construct another matrix, where here these, the first column, is this minus" "40 and two five and the second column is this two hundred open one and so on and it turns out that if you multiply these two matrices what you find is that, this first column, you know," "oh, well how do you get this first column, right?" "A procedure from matrix matrix multiplication is the way you get this first column, is you take this matrix and you multiply it by this first column, and we saw in the previous video that this is exactly the predicted" "housing prices of the first hypothesis, right?" "Of this first hypothesis here." "And, how about a second column?" "Well, how do setup the second column?" "The way you get the second column is, well, you take this matrix and you multiply by this second column." "And so this second column turns out to be the predictions of the second hypothesis of the second hypothesis up there, and similarly for the third column." "And so, I didn't step through all the details but hopefully you just, feel free to pause the video and check the math yourself and check that what I just claimed really is true." "But it turns out that by constructing these two matrices, what you can therefore do is very quickly apply all three hypotheses to all four house sizes to get, you know, all twelve predicted prices output by your" "three hypotheses on your four houses." "So one matrix multiplications that you manage to make 12 predictions and, even better, it turns out that in order to do that matrix multiplication and there are lots of good linear algebra libraries in order to do this" "multiplication step for you, and no matter so pretty much any reasonable programming language that you might be using." "Certainly all the top ten most popular programming languages will have great linear algebra libraries." "And they'll be good thing are highly optimized in order to do that, matrix matrix multiplication very efficiently, including taking, taking advantage of any parallel computation that your computer may be capable of, when your computer has multiple" "calls or lots of multiple processors, within a processor sometimes there's there's parallelism as well called symdiparallelism [sp]." "The computer take care of and you should, there are very good free libraries that you can use to do this matrix matrix multiplication very efficiently so that you can very efficiently, you know, makes lots of predictions of lots of hypotheses." "Matrix multiplication is really useful since you can pack a lot of computation into just one matrix multiplication operation." "But you should be careful of how you use them." "In this video I want to tell you about a few properties of matrix multiplication." "When working with just raw numbers or when working with scalars, multiplication is commutative." "And what I mean by that is if you take three times five, that is equal to five times three and the ordering of this multiplication doesn't matter." "And this is called the commutative property of multiplication of real numbers." "It turns out this property that you can, you know, reverse the order in which you multiply things, this is not true for matrix multiplication.So concretely, if A and" "B are matrices, then in general, A times B is not equal to B times" "A. So just be careful of that." "It's not okay to arbitrarily reverse the order in which you are multiplying matrices." "So, we say that matrix multiplication is not commutative, it's a fancy way of saying it." "As a concrete example, here are two matrices, matrix 1100 times 0020, and if you multiply these two matrices, you get this result on the right." "Now, let's swap around the order of these two matrices." "So, I'm going to take these two matrices and just reverse them." "It turns out if you multiply these two matrices, you get the second answer on the right and, you know, real clearly, these two matrices are not equal to each other." "So, in fact, in general, if you have a matrix operation like" "A times B. If A is an m by n matrix and B is an by" "M matrix, just as an example." "Then, it turns out that the matrix A times" "B right, is going to be an m by m matrix, where as the matrix b x a is going to be an n by n matrix so the dimensions don't even match, right, so A times B and" "B times A may not even be the same dimension." "In the example on the left," "I have all two by two matrices, so the dimensions were the same, but in general reversing the order of the matrices can even change the dimension of the outcome so matrix multiplication is not commutative." "Here's the next I want to talk about." "So, when talking about real numbers, or scalars, let's see, I have 3 times 5 times 2." "I can either multiply 5 times 2 first, and" "I can compute this as 3 times 10." "Or, I can multiply three times five for us and" "I can compute this as, you know fifteen times two and both of these give you the same answer, right?" "Each, both of these is equal to thirty so Whether I multiply five times two first or whether I multiply three times five first because well, three times five times two is equal to three times five times two." "And this is called the associative property of role number multiplication." "It turns out that matrix multiplication is associative." "So concretely, let's say" "I have a product of three matrices, A times B times" "C. Then I can compute this either as A times, B times C or I can compute this as" "A times B, times C and these will actually give me the same answer." "I'm not going to prove this, but you can just take my word for it, I guess." "So just be clear what I mean by these two cases, let's look at first one first case." "What I mean by that is if you actually want to compute" "A times B times C, what you can do is you can first compute B times C." "So that D equals B time" "C, then compute A times" "D. And so this is really computing a times B times C. Or, for this second case, You can compute this as, you can set E equals A times B. Then compute E times C. And this is then the same as a" "times B times C and it turns out that both of these options will give you, is guaranteed to give you the same answer." "And so we say that matrix multiplication does enjoy the associative property." "Okay?" "And don't worry about the terminology associative and commutative that's why there's not really going to use this terminology later in these class, so don't worry about memorizing those terms." "Finally, I want to tell you about the identity matrix, which is special matrix." "So let's again make the analogy to what we know of raw numbers, so when dealing with raw numbers or scalar numbers, the number one, is you can think of it as the identity of multiplication," "and what I mean by that is for any number" "Z, the number 1 times z is equal to z times one, and that's just equal to the number z, right, for any raw number." "Z. So 1 is the identity operation and so it satisfies this equation." "So it turns out that in the space of matrices as an identity matrix as well." "And it's unusually denoted i, or sometimes we write it as i of n by n we want to make explicit the dimensions." "So I subscript n by n is the n by n identity matrix." "And so there's a different identity matrix for each dimension n and are a few examples." "Here's the two by two identity matrix, here's the three by three identity matrix, here's the four by four identity matrix." "So the identity matrix, has the property that it has ones along the diagonals," "right, and so on and is zero everywhere else, and so, by the way the one by one identity matrix is just a number one." "This is one by one matrix just and it's not a very interesting identity matrix and informally when I or others are being sloppy, very often, we will write the identity matrix using fine notation." "I draw, you know, let's go back to it and just write 1111, dot, dot, dot, 1 and then we'll, maybe, somewhat sloppily write a bunch of zeros there." "And these zeros, this big zero, this big zero that's meant to denote that this matrix is zero everywhere except for the diagonals, so this is just how I might sloppily write this identity matrix" "She says property that for any matrix A, A times identity i times A" "A. So that's a lot like this equation that we have up here." "One times z equals z times one, equals z itself so" "I times A equals A times I equals A. Just make sure we have the dimensions right, so if A is a n by n matrix, then this identity matrix that's an m by n identity matrix." "And if A is m by n then this identity matrix, right, for matrix multiplication make sense that has a m by n matrix because this m has a match up that m And in either case the outcome" "of this process is you get back to Matrix A, which is m by n." "So whenever we write the identity matrix I, you know, very often the dimension rightwill be implicit from the context." "So these two I's they' re actually different dimension matrices, one may be N by N, the other is M by M But when we want to make the dimension of the matrix explicit, then sometimes" "we'll write to this I subscript" "N by N, kind of like we have up here." "But very often the dimension will be implicit." "Finally, just want to point out that earlier I said that A times B is not in general equal to B times A, right?" "That for most matrices A and B, this is not true." "But when B is the identity matrix, this does hold true." "That A times the identity matrix does indeed equal to identity times A, it's just that this is not true for other matrices, B in general." "So that's it for the properties of matrix multiplication." "And the special matrices, like the identity matrix I want to tell you about, in the next and final video now linear algebra review." "I am going to quickly tell you about a couple of special matrix operations, and after that you know everything you need to know about linear algebra for this course" "In this video, I want to tell you about a couple of special matrix operations, called the matrix inverse and the matrix transpose operation." "Let's start by talking about matrix inverse, and as usual we'll start by thinking about how it relates to real numbers." "In the last video, I said that the number one plays the role of the identity in the space of real numbers because one times anything is equal to itself." "It turns out that real numbers have this property that very number have an, that each number has an inverse, for example, given the number three, there exists some number, which happens to be three inverse so that" "that number times gives you back the identity element one." "And so to me, inverse of course this is just one third." "And given some other number, maybe twelve there is some number which is the inverse of twelve written as twelve to the minus one, or really this is just one twelve." "So that when you multiply these two things together." "the product is equal to the identity element one again." "Now it turns out that in the space of real numbers, not everything has an inverse." "For example the number zero does not have an inverse, right?" "Because zero's a zero inverse, one over zero that's undefined." "Like this one over zero is not well defined." "And what we want to do, in the rest of this slide, is figure out what does it mean to compute the inverse of a matrix." "Here's the idea:" "If" "A is a n by n matrix, and it has an inverse, I will say a bit more about that later, then the inverse is going to be written A to the minus one and A times this inverse, A to" "the minus one, is going to equal to A inverse times" "A, is going to give us back the identity matrix." "Okay?" "Only matrices that are m by m for some the idea of M having inverse." "So, a matrix is" "M by M, this is also called a square matrix and it's called square because the number of rows is equal to the number of columns." "Right and it turns out only square matrices have inverses, so A is a square matrix, is m by m, on inverse this equation over here." "Let's look at a concrete example, so let's say I have a matrix, three, four, two, sixteen." "So this is a two by two matrix, so it's a square matrix and so this may just could have an and it turns out that I happen to know the inverse of this matrix is zero point four, minus zero point" "one, minus zero point zero five, zero zero seven five." "And if I take this matrix and multiply these together it turns out what I get is the two by two identity matrix, I, this is I two by two." "Okay?" "And so on this slide, you know this matrix is the matrix A, and this matrix is the matrix A-inverse." "And it turns out if that you are computing A times A-inverse, it turns out if you compute A-inverse times" "A you also get back the identity matrix." "So how did I find this inverse or how did I come up with this inverse over here?" "It turns out that sometimes you can compute inverses by hand but almost no one does that these days." "And it turns out there is very good numerical software for taking a matrix and computing its inverse." "So again, this is one of those things where there are lots of open source libraries that you can link to from any of the popular programming languages to compute inverses of matrices." "Let me show you a quick example." "How I actually computed this inverse, and what I did was I used software called Optive." "So let me bring that up." "We will see a lot about Optive later." "Let me just quickly show you an example." "Set my matrix A to be equal to that matrix on the left, type three four two sixteen, so that's my matrix A right." "This is matrix 34, 216 that I have down here on the left." "And, the software lets me compute the inverse of A very easily." "It's like P over A equals this." "And so, this is right, this matrix here on my four minus, on my one, and so on." "This given the numerical solution to what is the inverse of A. So let me just write, inverse of A equals P inverse of" "A over that I can now just verify that A times A inverse the identity is, type A times the inverse of A and the result of that is this matrix and this is one one on the diagonal and essentially ten to" "the minus seventeen, ten to the minus sixteen, so Up to numerical precision, up to a little bit of round off error that my computer had in finding optimal matrices and these numbers off the diagonals are essentially zero" "so A times the inverse is essentially the identity matrix." "Can also verify the inverse of" "A times A is also equal to the identity, ones on the diagonals and values that are essentially zero except for a little bit of round dot error on the off diagonals." "If a definition that the inverse of a matrix is, I had this caveat first it must always be a square matrix, it had this caveat, that if" "A has an inverse, exactly what matrices have an inverse is beyond the scope of this linear algebra for review that one intuition you might take away that just as the number zero doesn't have an inverse, it turns out" "that if A is say the matrix of all zeros, then this matrix A also does not have an inverse because there's no matrix there's no A inverse matrix so that this matrix times some other matrix will give you the" "identity matrix so this matrix of all zeros, and there are a few other matrices with properties similar to this." "That also don't have an inverse." "But it turns out that in this review I don't want to go too deeply into what it means matrix have an inverse but it turns out for our machine learning application this shouldn't be an issue or more precisely" "for the learning algorithms where this may be an to namely whether or not an inverse matrix appears and I will tell when we get to those learning algorithms just what it means for an algorithm to have or not have an inverse and how to fix it in case." "Working with matrices that don't have inverses." "But the intuition if you want is that you can think of matrices as not have an inverse that is somehow too close to zero in some sense." "So, just to wrap up the terminology, matrix that don't have an inverse Sometimes called a singular matrix or degenerate matrix and so this matrix over here is an example zero zero zero matrix." "is an example of a matrix that is singular, or a matrix that is degenerate." "Finally, the last special matrix operation I want to tell you about is to do matrix transpose." "So suppose I have matrix A, if I compute the transpose of A, that's what I get here on the right." "This is a transpose which is written and A superscript T, and the way you compute the transpose of a matrix is as follows." "To get a transpose I am going to first take the first row of A one to zero." "That becomes this first column of this transpose." "And then I'm going to take the second row of A," "3 5 9, and that becomes the second column." "of the matrix A transpose." "And another way of thinking about how the computer transposes is as if you're taking this sort of 45 degree axis and you are mirroring or you are flipping the matrix along that 45 degree axis." "so here's the more formal definition of a matrix transpose." "Let's say A is a m by n matrix." "And let's let B equal A transpose and so BA transpose like so." "Then B is going to be a n by m matrix with the dimensions reversed so here we have a 2x3 matrix." "And so the transpose becomes a 3x2 matrix, and moreover, the BIJ is equal to AJI." "So the IJ element of this matrix B is going to be the JI element of that earlier matrix A. So for example, B 1 2 is going to be equal to, look at this matrix, B 1 2 is going to be equal to" "this element 3 1st row, 2nd column." "And that equal to this, which is a two one, second row first column, right, which is equal to two and some of the example B 3" "2, right, that's B 3 2 is this element 9, and that's equal to a two three which is this element up here, nine." "And so that wraps up the definition of what it means to take the transpose of a matrix and that in fact concludes our linear algebra review." "So by now hopefully you know how to add and subtract matrices as well as multiply them and you also know how, what are the definitions of the inverses and transposes of a matrix and these are the main operations" "used in linear algebra for this course." "In case this is the first time you are seeing this material." "I know this was a lot of linear algebra material all presented very quickly and it's a lot to absorb but if you there's no need to memorize all the definitions we just went through and if you download the copy of either" "these slides or of the lecture notes from the course website." "and use either the slides or the lecture notes as a reference then you can always refer back to the definitions and to figure out what are these matrix multiplications, transposes and so on definitions." "And the lecture notes on the course website also has pointers to additional resources linear algebra which you can use to learn more about linear algebra by yourself." "And next with these new tools." "We'll be able in the next few videos to develop more powerful forms of linear regression that can view of a lot more data, a lot more features, a lot more training examples and later on after the new regression we'll actually" "continue using these linear algebra tools to derive more powerful learning algorithims as well in this video we will start to talk about a new version of linear regression that's more powerful." "One that works with multiple variables or with multiple features." "Here's what I mean." "In the original version of linear regression that we developed, we have a single feature x, the size of the house, and we wanted to use that to predict why the price of the house and this was" "our form of our hypothesis." "But now imagine, what if we had not only the size of the house as a feature or as a variable of which to try to predict the price, but that we also knew the number of bedrooms, the number" "of house and the age of the home and years." "It seems like this would give us a lot more information with which to predict the price." "To introduce a little bit of notation, we sort of started to talk about this earlier," "I'm going to use the variables" "X subscript 1 X subscript 2 and so on to denote my, in this case, four features and I'm going to continue to use" "Y to denote the variable, the output variable price that we're trying to predict." "Let's introduce a little bit more notation." "Now that we have four features" "I'm going to use lowercase "n"" "to denote the number of features." "So in this example we have n4 because we have, you know, one, two, three, four features." "And "n" is different from our earlier notation where we were using "n" to denote the number of examples." "So if you have 47 rows "M" is the number of rows on this table or the number of training examples." "So I'm also going to use X superscript" ""I" to denote the input features of the "I" training example." "As a concrete example let say" "X2 is going to be a vector of the features for my second training example." "And so X2 here is going to be a vector 1416," "3, 2, 40 since those are my four features that I have" "to try to predict the price of the second house." "So, in this notation, the superscript 2 here." "That's an index into my training set." "This is not X to the power of 2." "Instead, this is, you know, an index that says look at the second row of this table." "This refers to my second training example." "With this notation X2 is a four dimensional vector." "In fact, more generally, this is an in-dimensional feature back there." "With this notation, X2 is now a vector and so," "I'm going to use also Xi subscript J to denote the value of the J," "of feature number J and the training example." "So concretely X2 subscript 3, will refer to feature number three in the x factor which is equal to 2,right?" "That was a 3 over there, just fix my handwriting." "So x2 subscript 3 is going to be equal to 2." "Now that we have multiple features, let's talk about what the form of our hypothesis should be." "Previously this was the form of our hypothesis, where x was our single feature, but now that we have multiple features, we aren't going to use the simple representation any more." "Instead, a form of the hypothesis in linear regression" "is going to be this, can be theta 0 plus theta" "1 x1 plus theta 2 x2 plus theta 3 x3" "plus theta 4 X4." "And if we have N features then rather than summing up over our four features, we would have a sum over our N features." "Concretely for a particular setting of our parameters we may have H of" "X 80 + 0.1 X1 + 0.01x2 + 3x3 - 2x4." "This would be one example of a hypothesis and you remember a hypothesis is trying to predict the price of the house in thousands of dollars, just saying that, you know, the base price of a house is maybe 80,000 plus another open" "1, so that's an extra, what, hundred dollars per square feet, yeah, plus the price goes up a little bit for each additional floor that the house has." "X two is the number of floors, and it goes up further for each additional bedroom the house has, because" "X three was the number of bedrooms, and the price goes down a little bit with each additional age of the house." "With each additional year of the age of the house." "Here's the form of a hypothesis rewritten on the slide." "And what I'm gonna do is introduce a little bit of notation to simplify this equation." "For convenience of notation, let me define x subscript 0 to be equals one." "Concretely, this means that for every example i I have a feature vector X superscript" "I and X superscript" "I subscript 0 is going to be equal to 1." "You can think of this as defining an additional zero feature." "So whereas previously I had n features because x1, x2 through xn, I'm now defining an additional sort of zero" "feature vector that always takes on the value of one." "So now my feature vector" "X becomes this N+1 dimensional vector that is zero index." "So this is now a n+1 dimensional feature vector, but" "I'm gonna index it from 0 and I'm also going to think of my parameters as a vector." "So, our parameters here, right that would be our theta zero, theta one, theta two, and so on all the way up to theta n, we're going to gather them up into a parameter" "vector written theta 0, theta 1, theta 2, and so on, down to theta n." "This is another zero index vector." "It's of index signed from zero." "That is another n plus 1 dimensional vector." "So, my hypothesis cannot be written theta 0x0 plus theta 1x1+ up to theta n Xn." "And this equation is the same as this on top because, you know, eight zero is equal to one." "Underneath and I now take this form of the hypothesis and write this as either transpose x, depending on how familiar you are with inner products of vectors if you write what theta transfers x is what theta transfer and" "this is theta zero, theta one, up to theta" "N. So this thing here is theta transpose and this is actually a N plus one by one matrix." "It's also called a row vector and you take that and multiply it with the vector X which is X zero, X one, and so on, down to X n." "And so, the inner product that is theta transpose X is just equal to this." "This gives us a convenient way to write the form of the hypothesis as just the inner product between our parameter vector theta and our theta vector X. And it is this little bit of notation, this little excerpt of the" "notation convention that let us write this in this compact form." "So that's the form of a hypthesis when we have multiple features." "And, just to give this another name, this is also called multivariate linear regression." "And the term multivariable that's just maybe a fancy term for saying we have multiple features, or multivariables with which to try to predict the value Y." "In the last video we talked about the Multivariate Gaussian Distribution" "and saw some examples of the sorts of distributions you can model, as you vary the parameters, mu and sigma." "In this video, let's take those ideas, and apply them to develop a different anomaly detection algorithm." "To recap the multivariate Gaussian distribution and the multivariate normal distribution has two parameters, mu and sigma." "Where mu this an n dimensional vector and sigma, the covariance matrix, is an n by n matrix." "And here's the formula for the probability of X, as parameterized by mu and sigma, and as you vary mu and sigma, you can get a range of different distributions, like, you know, these are three examples of the" "ones that we saw in the previous video." "So let's talk about the parameter fitting or the parameter estimation problem." "The question, as usual, is if" "I have a set of examples" "X1 through XM and here each of these examples is an n dimensional vector and I think my examples come from a multivariate Gaussian distribution." "How do I try to estimate my parameters mu and sigma?" "Well the standard formulas for estimating them is you set mu to be just the average of your training examples." "And you set sigma to be equal to this." "And this is actually just like the sigma that we had written out, when we were using the PCA or the Principal Components Analysis algorithm." "So you just plug in these two formulas and this would give you your estimated parameter mu and your estimated parameter sigma." "So given the data set here is how you estimate mu and sigma." "Let's take this method and just plug it into an anomaly detection algorithm." "So how do we put all of this together to develop an anomaly detection algorithm?" "Here 's what we do." "First we take our training set, and we fit the model, we fit P of X, by, you know, setting mu and sigma as described" "on the previous slide." "Next when you are given a new example X. So if you are given a test example," "lets take an earlier example to have a new example out here." "And that is my test example." "Given the new example X, what we are going to do is compute" "P of X, using this formula for the multivariate Gaussian distribution." "And then, if P of" "X is very small, then we flagged it as an anomaly, whereas, if P of X is greater than that parameter epsilon, then we don't flag it as an anomaly." "So it turns out, if we were to fit a multivariate Gaussian distribution to this data set, so just the red crosses, not the green example, you end up with a Gaussian distribution that places lots of probability in the central" "region, slightly less probability here, slightly less probability here, slightly less probability here," "and very low probability at the point that is way out here." "And so, if you apply the multivariate Gaussian distribution to this example, it will actually correctly flag that example." "as an anomaly." "Finally it's worth saying a few words about what is the relationship between the multivariate Gaussian distribution model, and the original model, where we were modeling P of" "X as a product of this" "P of X1, P of X2, up to P of Xn." "It turns out that you can prove mathematically, I'm not going to do the proof here, but you can prove mathematically that this relationship, between the multivariate Gaussian model and this original one." "And in particular, it turns out that the original model corresponds to multivariate Gaussians, where the contours of the" "Gaussian are always axis aligned." "So all three of these are examples of" "Gaussian distributions that you can fit using the original model." "It turns out that that corresponds to multivariate Gaussian, where, you know, the ellipsis here, the contours of this distribution--it turns out that this model actually corresponds to a special case of a multivariate Gaussian distribution." "And in particular, this special case is defined by constraining" "the distribution of p of x, the multivariate a Gaussian distribution of p of x, so that the contours of the probability density function, of the probability distribution function, are axis aligned." "And so you can get a p of x with a multivariate Gaussian that looks like this, or like this, or like this." "And you notice, that in all 3 of these examples, these ellipses, or these ovals that I'm drawing, have their axes aligned with the X1 X2 axes." "And what we do not have, is a set of contours that are at an angle, right?" "And this corresponded to examples where sigma is equal to 1 1, 0.8, 0.8." "Let's say, with non-0 elements on the off diagonals." "So, it turns out that it's possible to show mathematically that this model actually is the same as a multivariate Gaussian distribution but with a constraint." "And the constraint is that the covariance matrix sigma must have 0's on the off diagonal elements." "In particular, the covariance matrix sigma, this thing here, it would be sigma squared 1, sigma squared 2, down to sigma squared n, and then everything on the off diagonal entries, all of these elements" "above and below the diagonal of the matrix, all of those are going to be zero." "And in fact if you take these values of sigma, sigma squared 1, sigma squared 2, down to sigma squared n, and plug them into here, and you know, plug them into this covariance matrix, then the" "two models are actually identical." "That is, this new model," "using a multivariate Gaussian distribution, corresponds exactly to the old model, if the covariance matrix sigma, has only" "0 elements off the diagonals, and in pictures that corresponds to having Gaussian distributions," "where the contours of this distribution function are axis aligned." "So you aren't allowed to model the correlations between the diffrent features." "So in that sense the original model is actually a special case of this multivariate Gaussian model." "So when would you use each of these two models?" "So when would you the original model and when would you use the multivariate Gaussian model?" "The original model is probably used somewhat more often," "and whereas the multivariate Gaussian distribution is used somewhat less but it has the advantage of being able to capture correlations between features." "So suppose you want to capture anomalies where you have different features say where features x1, x2 take on unusual combinations of values so in the earlier example, we had that example where the anomaly was with the CPU load and the memory use taking on unusual combinations of values, if" "you want to use the original model to capture that, then what you need to do is create an extra feature, such as X3 equals X1/X2, you know equals maybe the CPU load divided by the memory used, or something, and you" "need to create extra features if there's unusual combinations of values where X1 and X2 take on an unusual combination of values even though X1 by itself and X2 by itself" "looks like it's taking a perfectly normal value." "But if you're willing to spend the time to manually create an extra feature like this, then the original model will work fine." "Whereas in contrast, the multivariate Gaussian model can automatically capture correlations between different features." "But the original model has some other more significant advantages, too, and one huge advantage of the original model" "is that it is computationally cheaper, and another view on this is that is scales better to very large values of n and very large numbers of features, and so even if n were ten thousand," "or even if n were equal to a hundred thousand, the original model will usually work just fine." "Whereas in contrast for the multivariate Gaussian model notice here, for example, that we need to compute the inverse of the matrix sigma where sigma is an n by n matrix" "and so computing sigma if sigma is a hundred thousand by a hundred thousand matrix that is going to be very computationally expensive." "And so the multivariate Gaussian model scales less well to large values of N. And finally for the original model, it turns out to work out ok even if you have a relatively small training set this is the small unlabeled" "examples that we use to model p of x of course, and this works fine, even if" "M is, you know, maybe 50, 100, works fine." "Whereas for the multivariate Gaussian, it is sort of a mathematical property of the algorithm that you must have m greater than n, so that the number of examples is greater than the number of features you have." "And there's a mathematical property of the way we estimate the parameters" "that if this is not true, so if m is less than or equal to n, then this matrix isn't even invertible, that is this matrix is singular, and so you can't even use the" "multivariate Gaussian model unless you make some changes to it." "But a typical rule of thumb that I use is, I will use the multivariate Gaussian model only if m is much greater than n, so this is sort of the" "narrow mathematical requirement, but in practice, I would use the multivariate Gaussian model, only if m were quite a bit bigger than n." "So if m were greater than or equal to 10 times n, let's say, might be a reasonable rule of thumb, and if" "it doesn't satisfy this, then the multivariate Gaussian model has a lot of parameters, right, so this covariance matrix sigma is an n by n matrix, so it has, you know, roughly n squared parameters, because it's a symmetric matrix," "it's actually closer to n squared over 2 parameters, but this is a lot of parameters, so you need make sure you have a fairly large value for m, make sure you have enough data to fit all these parameters." "And m greater than or equal to 10 n would be a reasonable rule of thumb to make sure that you can estimate this covariance matrix sigma reasonably well." "So in practice the original model shown on the left that is used more often." "And if you suspect that you need to capture correlations between features what people will often do is just manually design extra features like these to capture specific unusual combinations of values." "But in problems where you have a very large training set or m is very large and n is not too large, then the multivariate Gaussian model is well worth considering and may work better as well, and can" "save you from having to spend your time to manually create extra features in case" "the anomalies turn out to be captured by unusual combinations of values of the features." "Finally I just want to briefly mention one somewhat technical property, but if you're fitting multivariate Gaussian model, and if you find that the covariance matrix sigma is singular, or you find it's non-invertible, they're usually 2 cases for this." "One is if it's failing to satisfy this m greater than n condition, and the second case is if you have redundant features." "So by redundant features, I mean, if you have 2 features that are the same." "Somehow you accidentally made two copies of the feature, so your x1 is just equal to x2." "Or if you have redundant features like maybe your features X3 is equal to feature X4, plus feature X5." "Okay, so if you have highly redundant features like these, you know, where if X3 is equal to X4 plus X5, well X3 doesn't contain any extra information, right?" "You just take these 2 other features, and add them together." "And if you have this sort of redundant features, duplicated features, or this sort of features, than sigma may be non-invertible." "And so there's a debugging set-- this should very rarely happen, so you probably won't run into this, it is very unlikely that you have to worry about this-- but in case you implement a" "multivariate Gaussian model you find that sigma is non-invertible." "What I would do is first make sure that M is quite a bit bigger than N, and if it is then, the second thing I do, is just check for redundant features." "And so if there are 2 features that are equal, just get rid of one of them, or if you have redundant if these" ", X3 equals X4 plus X5, just get rid of the redundant feature, and then it should work fine again." "As an aside for those of you who are experts in linear algebra, by redundant features, what I mean is the formal term is features that are linearly dependent." "But in practice what that really means is one of these problems tripping up the algorithm if you just make you features non-redundant., that should solve the problem of sigma being non-invertable." "But once again the odds of your running into this at all are pretty low so chances are, you can just apply the multivariate Gaussian model, without having to worry about sigma being non-invertible, so long as m is greater" "than or equal to n." "So that's it for anomaly detection, with the multivariate Gaussian distribution." "And if you apply this method you would be able to have an anomaly detection algorithm that automatically captures positive and negative correlations between your different features and flags an anomaly if it sees is unusual combination of the values of the features." "In the previous video, we talked about the form of the hypothesis for linear regression with multiple features or with multiple variables." "In this video, let's talk about how to fit the parameters of that hypothesis." "In particular let's talk about how to use gradient descent for linear regression with multiple features." "To quickly summarize our notation, this is our formal hypothesis in multivariable linear regression where we've adopted the convention that x0=1." "The parameters of this model are theta0 through theta n, but instead of thinking of this as n separate parameters, which is valid, I'm instead going to think of the parameters as theta where theta here is a n+1-dimensional vector." "So I'm just going to think of the parameters of this model as itself being a vector." "Our cost function is J of theta0 through theta n which is given by this usual sum of square of error term." "But again instead of thinking of J as a function of these n+1 numbers, I'm going to more commonly write J as just a function of the parameter vector theta so that theta here is a vector." "Here's what gradient descent looks like." "We're going to repeatedly update each parameter theta j according to theta j minus alpha times this derivative term." "And once again we just write this as J of theta, so theta j is updated as theta j minus the learning rate alpha times the derivative, a partial derivative of the cost function with respect to the parameter theta j." "Let's see what this looks like when we implement gradient descent and, in particular, let's go see what that partial derivative term looks like." "Here's what we have for gradient descent for the case of when we had N=1 feature." "We had two separate update rules for the parameters theta0 and theta1, and hopefully these look familiar to you." "And this term here was of course the partial derivative of the cost function with respect to the parameter of theta0, and similarly we had a different update rule for the parameter theta1." "There's one little difference which is that when we previously had only one feature, we would call that feature x(i) but now in our new notation we would of course call this x(i)1 to denote our one feature." "So that was for when we had only one feature." "Let's look at the new algorithm for we have more than one feature, where the number of features n may be much larger than one." "We get this update rule for gradient descent and, maybe for those of you that know calculus, if you take the definition of the cost function and take the partial derivative of the cost function J with respect to the parameter" "theta j, you'll find that that partial derivative is exactly that term that" "I've drawn the blue box around." "And if you implement this you will get a working implementation of gradient descent for multivariate linear regression." "The last thing I want to do on this slide is give you a sense of why these new and old algorithms are sort of the same thing or why they're both similar algorithms or why they're both gradient descent algorithms." "Let's consider a case where we have two features or maybe more than two features, so we have three update rules for the parameters theta0, theta1, theta2 and maybe other values of theta as well." "If you look at the update rule for theta0, what you find is that this update rule here is the same as the update rule that we had previously for the case of n = 1." "And the reason that they are equivalent is, of course, because in our notational convention we had this x(i)0 = 1 convention, which is why these two term that I've drawn the magenta boxes around are equivalent." "Similarly, if you look the update rule for theta1, you find that this term here is equivalent to the term we previously had, or the equation or the update rule we previously had for theta1, where of course we're just using this new notation x(i)1 to denote" "our first feature, and now that we have more than one feature we can have similar update rules for the other parameters like theta2 and so on." "There's a lot going on on this slide so I definitely encourage you if you need to to pause the video and look at all the math on this slide slowly to make sure you understand everything that's going on here." "But if you implement the algorithm written up here then you have a working implementation of linear regression with multiple features." "In this video and in the video after this one, I wanna tell you about some of the practical tricks for making gradient descent work well." "In this video, I want to tell you about an idea called feature skill." "Here's the idea." "If you have a problem where you have multiple features, if you make sure that the features are on a similar scale, by which I mean make sure that the different features take on similar ranges of values," "then gradient descents can converge more quickly." "Concretely let's say you have a problem with two features where X1 is the size of house and takes on values between say zero to two thousand and two is the number of bedrooms, and maybe that takes on values between one and five." "If you plot the contours of the cos function J of theta," "then the contours may look like this, where, let's see," "J of theta is a function of parameters theta zero, theta one and theta two." "I'm going to ignore theta zero, so let's about theta 0 and pretend as a function of only theta 1 and theta" "2, but if x1 can take on them, you know, much larger range of values and x2 It turns out that the contours of the cause function J of theta" "can take on this very very skewed elliptical shape, except that with the so 2000 to" "5 ratio, it can be even more secure." "So, this is very, very tall and skinny ellipses, or these very tall skinny ovals, can form the contours of the cause function J of theta." "And if you run gradient descents on this cos-function, your gradients may end up taking a long time and can oscillate back and forth and take a long time before it can finally find its way to the global minimum." "In fact, you can imagine if these contours are exaggerated even more when you draw incredibly skinny, tall skinny contours," "and it can be even more extreme than, then, gradient descent just have a much harder time taking it's way, meandering around, it can take a long time to find this way to the global minimum." "In these settings, a useful thing to do is to scale the features." "Concretely if you instead define the feature X one to be the size of the house divided by two thousand, and define X two to be maybe the number of bedrooms divided by five, then the count well as of the" "cost function J can become much more, much less skewed so the contours may look more like circles." "And if you run gradient descent on a cost function like this, then gradient descent," "you can show mathematically, you can find a much more direct path to the global minimum rather than taking a much more convoluted path where you're sort of trying to follow a much more complicated trajectory to get to the global minimum." "So, by scaling the features so that there are, the consumer ranges of values." "In this example, we end up with both features, X one and X two, between zero and one." "You can wind up with an implementation of gradient descent." "They can convert much faster." "More generally, when we're performing feature scaling, what we often want to do is get every feature into approximately a -1 to +1 range and concretely, your feature x0 is always equal to 1." "So, that's already in that range, but you may end up dividing other features by different numbers to get them to this range." "The numbers -1 and +1 aren't too important." "So, if you have a feature, x1 that winds up being between zero and three, that's not a problem." "If you end up having a different feature that winds being between -2 and + 0.5, again, this is close enough to minus one and plus one that, you know, that's fine, and that's fine." "It's only if you have a different feature, say X 3 that is between, that" "ranges from -100 tp +100" ", then, this is a very different values than minus 1 and plus 1." "So, this might be a less well-skilled feature and similarly, if your features take on a very, very small range of values so if X 4 takes on values between minus" "0.0001 and positive 0.0001, then again this takes on a much smaller range of values than the minus one to plus one range." "And again I would consider this feature poorly scaled." "So you want the range of values, you know, can be bigger than plus or smaller than plus one, but just not much bigger, like plus" "100 here, or too much smaller like 0.00 one over there." "Different people have different rules of thumb." "But the one that I use is that if a feature takes on the range of values from say minus three the plus" "3 how you should think that should be just fine, but maybe it takes on much larger values than plus 3 or minus 3 unless not to worry and if it takes on values from say minus one-third to one-third." "You know, I think that's fine too or 0 to one-third or minus one-third to 0." "I guess that's typical range of value sector 0 okay." "But it will take on a much tinier range of values like x4 here than gain on mine not to worry." "So, the take-home message is don't worry if your features are not exactly on the same scale or exactly in the same range of values." "But so long as they're all close enough to this gradient descent it should work okay." "In addition to dividing by so that the maximum value when performing feature scaling sometimes people will also do what's called mean normalization." "And what I mean by that is that you want to take a feature Xi and replace it with Xi minus new i" "to make your features have approximately 0 mean." "And obviously we want to apply this to the future x zero, because the future x zero is always equal to one, so it cannot have an average value of zero." "But it concretely for other features if the range of sizes of the house takes on values between 0 to 2000 and if you know, the average size of a house is equal to" "1000 then you might use this formula." "Size, set the feature" "X1 to the size minus the average value divided by 2000 and similarly, on average if your houses have one to five bedrooms and if" "on average a house has two bedrooms then you might use this formula to mean normalize your second feature x2." "In both of these cases, you therefore wind up with features x1 and x2." "They can take on values roughly between minus .5 and positive .5." "Exactly not true" " X2 can actually be slightly larger than .5 but, close enough." "And the more general rule is that you might take a feature X1 and replace" "it with X1 minus mu1 over S1 where to define these terms mu1 is the average value of x1" "in the training sets and S1 is the range of values of that feature and by range, I mean let's say the maximum value minus the minimum value or for those of you that understand the deviation of the variable is setting S1" "to be the standard deviation of the variable would be fine, too." "But taking, you know, this max minus min would be fine." "And similarly for the second feature, x2, you replace x2 with this sort of" "subtract the mean of the feature and divide it by the range of values meaning the max minus min." "And this sort of formula will get your features, you know, maybe not exactly, but maybe roughly into these sorts of ranges, and by the way, for those of you that are being super careful technically if" "we're taking the range as max minus min this five here will actually become a four." "So if max is 5 minus 1 then the range of their own values is actually equal to 4, but all of these are approximate and any value that gets the features into anything close to these sorts of ranges will do fine." "And the feature scaling doesn't have to be too exact, in order to get gradient descent to run quite a lot faster." "So, now you know about feature scaling and if you apply this simple trick, it and make gradient descent run much faster and converge in a lot fewer other iterations." "That was feature scaling." "In the next video, I'll tell you about another trick to make gradient descent work well in practice." "In this video, I wanna give you more practical tips for getting gradient descent to work." "The ideas in this video will center around the learning rate alpha." "Concretely, here's the gradient descent update rule and what" "I want to do in this video is tell you about what" "I think of as debugging and some tips for making sure that" "Gradient Descent is working correctly and second, I want to tell you how to choose the rates out for, but this is how" "I go about choosing it." "Here's something that I often do to make sure gradient descent is working correctly." "The job of gradient descent is to find a value of theta for you that, you know, hopefully minimizes the cost function j of theta." "What I often do is therefore pluck the cost function j of theta as gradient descent runs." "So, the x-axis here is the number of iteration of gradient descent and as gradient descent runs, you'll hopefully get a plot that maybe looks like this." "Notice that the x-axis is a number of iterations previously we were looking at plots of" "J of theta where the" "X-axis, where the horizontal axis, was the parameter vector theta but this is not where this is." "Concretely, what this point is is I'm going to rank gradient descent for hundred iterations." "And whatever value I get for theta after a hundred of the rations and get, you know, some value of theta after a hundred iterations and I'm going to evaluate the cost function J of theta for the value of theta I get" "after a hundred iterations and this vertical height is the value of J of theta for the value of theta I got after a hundred other ratios of gradient descent and this point here, that corresponds to the value of J of" "theta for the theta that I get after I've run grade and descent for two hundred iterations." "So what this plot is showing, is it's showing the value of your cost function after iteration of gradient descent." "And, if gradient descent is working properly, then J of theta should decrease." "after every iteration." "And one useful thing that this sort of plot can tell you also is that if you look at the specific figure that I've drawn, it looks like by the time you've gotten out to three hundred iterations," "between three and four hundred iterations, in this segment, it looks like J of theta hasn't gone down much more." "So by the time you get to four hundred iterations, it looks like this curve has flattened out here." "And so, way out here at four hundred iterations, it looks like grade and descend has more or less converged because your cost function isn't going down much more." "So looking at this figure can also help you judge whether or not gradient descent has converged." "By the way, the number of iterations that gradient descent takes to converge for a physical application can vary a lot." "So maybe for one application gradient descent may converge after just thirty iterations, for a different application gradient descent made the 3,000 iterations." "For another learning algorithm it may take three million iterations." "It turns out to be very difficult to tell in advance how many iterations gradient descent needs to converge, and is usually by plotting this sort of plot." "Plotting the cause function as we increase the number of iterations." "It's usually by looking at these plots that I tried to tell if gradient descent has converged." "It is also possible to come up with automatic convergence test; namely to have an algorithm to try to tell you if gradient descent has converged and here's maybe a pretty typical example of an automatic convergence test and" "so, you test the clear convergence if your cause function jf theta decreases by less than some small value epsilon, some small value ten to the minus three in one iteration, but I find that usually choosing what this threshold is is pretty difficult." "So, in order to check your gradient descent has converged, I actually tend to look at plots like like this figure on the left rather than rely on an automatic convergence test." "Looking at this sort of figure can also tell you or give you an advanced warning if maybe gradient descent is not working correctly." "Concretely, if you plug jf theta as a function of number of iterations, then, if you see a figure like this, where J of theta is actually increasing, then that gives you a clear sign that gradient descent is not working." "And a figure like this usually means that you should be using smaller learning rate alpha." "If J of theta is actually increasing, the most common cause for that is if you're trying to minimize the function that maybe looks like this." "That's if your learning rate is too big then if you start off there, gradient descent may overshoot the minimum, send you there, then if only there's too big, you may overshoot again, it will send you there and" "so on so that what you really wanted was really start here and for to slowly go downhill." "But if the learning is too big then gradient descent can instead keep on over shooting the minimum so that you actually end up getting worse and worse instead of getting the higher values of the cost function j of theta" "so do you end up with a plot like and if you see a plot like this the fix usually is to just use a smaller value of alpha." "Oh, and also of course make sure that your code does not have a bug in it." "But usually to watch it out of the firms is the most common, could be a common problem." "Similarly, sometimes, you may also see j of theta do something like this and it go down for a while then go up then go down for a while then go up." "Go down for a while, it goes up and so on and and to fix for something like this is also to use a smaller value of alpha." "I'm not going to prove it here, but undeniable assumptions about the cost function, which does proof of linear regression." "You can show of mathematicians have shown that if your learning rate offer is small enough then j of theta should decrease on every single iteration." "So, if this doesn't happen, probably means algorithm is too big then you should send a smaller, but of course, you all So you don't want your learning rate to be too small because if you" "do that, if you were to do that, then gradient descent can be slow to converge." "And if alpha were too small, you might end up starting out here, say, and, you know, end up taking just minuscule, minuscule baby steps." "Right?" "And just taking a lot of iterations before you finally get to the minimum." "And so, if alpha is too small, gradient descent can make very slow progress and be slow to converge." "To summarize, if the learning rate is too small, you can have a slow convergence problem, and if the learning rate is too large, j of theta may not decrease on every iteration and may not even converge." "In some cases, if the learning rate is too large, slow convergence is also possible, but the more common problem you see is that just that j of theta may not decrease on every iteration." "And in order to debug all of these things, often plotting that j of theta as a function of the number of iterations can help you figure out what's going on." "Concretely, what I actually do when I run gradient descent is I would try a range of values." "So just try running gradient descent with a range of values for alpha, like 0.001, 0.01, so these are a factor of 10 differences, and for these differences of this of alpha, just plot j of" "theta as a function of number of iterations and then pick the value of alpha that, you know, seems to be causing j of theta to decrease rapidly." "In fact, what I do actually isn't these steps of ten." "So, you know, this is a scale factor of ten if you reach the top." "What I'll actually do is try this range of values and so on where this is, you know, 0.001 then increase the linear rate to" "3.4 to get 0.03 and then to step up this is another roughly 3 fold increase point of 0.03 to 0.01s and so these are roughly, you know, trying out gradient descents with each value I try being about" "3X bigger than the previous value." "So what I'll do is a range of values until I've made sure that I've found one value that is too small and made sure" "I found one value that is too large, and then I sort of try to pick the largest possible value or just something slightly smaller than the largest reasonable value that I found." "And when I do that usually it just gives me a good learning rate for my problem." "And if you do this too, hopefully you will be able to choose a good learning rate for your implementation of gradient descent." "You now know about linear regression with multiple variables." "In this video, I wanna tell you a bit about the choice of features that you have and how you can get different learning algorithm, sometimes very powerful ones by choosing appropriate features." "And in particular I also want to tell you about polynomial regression allows you to use the machinery of linear regression to fit very complicated, even very non-linear functions." "Let's take the example of predicting the price of the house." "Suppose you have two features, the frontage of house and the depth of the house." "So, here's the picture of the house we're trying to sell." "So, the frontage is defined as this distance is basically the width or the length of how wide your lot is if this that you own, and the depth of the house is how deep your property is, so" "there's a frontage, there's a depth." "called frontage and depth." "You might build a linear regression model like this where frontage is your first feature x1 and and depth is your second feature x2, but when you're applying linear regression, you don't necessarily have to use just the features x1 and x2 that you're given." "What you can do is actually create new features by yourself." "So, if I want to predict the price of a house, what I might do instead is decide that what really determines the size of the house is the area or the land area that I own." "So, I might create a new feature." "I'm just gonna call this feature x which is frontage, times depth." "This is a multiplication symbol." "It's a frontage x depth because this is the land area that I own and I might then select my hypothesis as that using just one feature which is my land area, right?" "Because the area of a rectangle is you know, the product of the length of the size So, depending on what insight you might have into a particular problem, rather than just taking the features [xx]" "that we happen to have started off with, sometimes by defining new features you might actually get a better model." "Closely related to the idea of choosing your features is this idea called polynomial regression." "Let's say you have a housing price data set that looks like this." "Then there are a few different models you might fit to this." "One thing you could do is fit a quadratic model like this." "It doesn't look like a straight line fits this data very well." "So maybe you want to fit a quadratic model like this where you think the size, where you think the price is a quadratic function and maybe that'll give you, you know, a fit to the data that looks like that." "But then you may decide that your quadratic model doesn't make sense because of a quadratic function, eventually this function comes back down and well, we don't think housing prices should go down when the size goes up too high." "So then maybe we might choose a different polynomial model and choose to use instead a cubic function, and where we have now a third-order term and we fit that, maybe we get this sort of model, and maybe the" "green line is a somewhat better fit to the data cause it doesn't eventually come back down." "So how do we actually fit a model like this to our data?" "Using the machinery of multivariant linear regression, we can do this with a pretty simple modification to our algorithm." "The form of the hypothesis we, we know how the fit looks like this, where we say" "H of x is theta zero plus theta one x one plus x two theta X3." "And if we want to fit this cubic model that" "I have boxed in green, what we're saying is that to predict the price of a house, it's theta 0 plus theta" "1 times the size of the house plus theta 2 times the square size of the house." "So this term is equal to that term." "And then plus theta 3 times the cube of the size of the house raises that third term." "In order to map these two definitions to each other, well, the natural way to do that is to set the first feature x one to be the size of the house, and set the second feature x two" "to be the square of the size of the house, and set the third feature x three to be the cube of the size of the house." "And, just by choosing my three features this way and applying the machinery of linear regression, I can fit this model and end up with a cubic fit to my data." "I just want to point out one more thing, which is that if you choose your features like this, then feature scaling becomes increasingly important." "So if the size of the house ranges from one to a thousand, so, you know, from one to a thousand square feet, say, then the size squared of the house will range from one to one" "million, the square of a thousand, and your third feature x cubed, excuse me you, your third feature x three, which is the size cubed of the house, will range from one two ten to" "the nine, and so these three features take on very different ranges of values, and it's important to apply feature scaling if you're using gradient descent to get them into comparable ranges of values." "Finally, here's one last example of how you really have broad choices in the features you use." "Earlier we talked about how a quadratic model like this might not be ideal because, you know, maybe a quadratic model fits the data okay, but the quadratic function goes back down and we really don't want, right," "housing prices that go down, to predict that, as the size of housing freezes." "But rather than going to a cubic model there, you have, maybe, other choices of features and there are many possible choices." "But just to give you another example of a reasonable choice, another reasonable choice might be to say that the price of a house is theta zero plus theta one times the size, and then plus theta two times the square root of the size, right?" "So the square root function is this sort of function, and maybe there will be some value of theta one, theta two, theta three, that will let you take this model and, for the curve that looks" "like that, and, you know, goes up, but sort of flattens out a bit and doesn't ever come back down." "And, so, by having insight into, in this case, the shape of a square root function, and, into the shape of the data, by choosing different features, you can sometimes get better models." "In this video, we talked about polynomial regression." "That is, how to fit a polynomial, like a quadratic function, or a cubic function, to your data." "Was also throw out this idea, that you have a choice in what features to use, such as that instead of using the frontish and the depth of the house, maybe, you can multiply them together to get" "a feature that captures the land area of a house." "In case this seems a little bit bewildering, that with all these different feature choices, so how do I decide what features to use." "Later in this class, we'll talk about some algorithms were automatically choosing what features are used, so you can have an algorithm look at the data and automatically choose for you whether you want to fit a quadratic function, or a cubic function, or something else." "But, until we get to those algorithms now I just want you to be aware that you have a choice in what features to use, and by designing different features you can fit more complex functions your data then just fitting a" "straight line to the data and in particular you can put polynomial functions as well and sometimes by appropriate insight into the feature simply get a much better model for your data." "In this video, we'll talk about the normal equation, which for some linear regression problems, will give us a much better way to solve for the optimal value of the parameters theta." "Concretely, so far the algorithm that we've been using for linear regression is gradient descent where in order to minimize the cost function" "J of Theta, we would take this iterative algorithm that takes many steps, multiple iterations of gradient descent to converge to the global minimum." "In contrast, the normal equation would give us a method to solve for theta analytically, so that rather than needing to run this iterative algorithm, we can instead just solve for the optimal value for theta all at one go, so that in" "basically one step you get to the optimal value right there." "It turns out the normal equation that has some advantages and some disadvantages, but before we get to that and talk about when you should use it, let's get some intuition about what this method does." "For this week's planetary example, let's imagine, let's take a very simplified cost function" "J of Theta, that's just the function of a real number Theta." "So, for now, imagine that Theta is just a scalar value or that Theta is just a row value." "It's just a number, rather than a vector." "Imagine that we have a cost function J that's a quadratic function of this real value parameter Theta, so J of Theta looks like that." "Well, how do you minimize a quadratic function?" "For those of you that know a little bit of calculus, you may know that the way to minimize a function is to take derivatives and to set derivatives equal to zero." "So, you take the derivative of J with respect to the parameter of Theta." "You get some formula which I am not going to derive, you set that derivative equal to zero, and this allows you to solve for the value of Theda that minimizes J of Theta." "That was a simpler case of when data was just real number." "In the problem that we are interested in, Theta is no longer just a real number, but, instead, is this n+1-dimensional parameter vector, and, a cost function J is a function of this vector" "value or Theta 0 through" "Theta m." "And, a cost function looks like this, some square cost function on the right." "How do we minimize this cost function J?" "Calculus actually tells us that, if you, that one way to do so, is to take the partial derivative of J, with respect to every parameter of Theta J in turn, and then, to set" "all of these to 0." "If you do that, and you solve for the values of" "Theta 0, Theta 1, up to Theta N, then, this would give you that values of Theta to minimize the cost function J. Where, if you actually work through the calculus and work through the solution to the parameters" "Theta 0 through Theta N, the derivation ends up being somewhat involved." "And, what I am going to do in this video, is actually to not go through the derivation, which is kind of long and kind of involved, but what I want to do is just tell you what you need to know" "in order to implement this process so you can solve for the values of the thetas that corresponds to where the partial derivatives is equal to zero." "Or alternatively, or equivalently, the values of Theta is that minimize the cost function J of Theta." "I realize that some of the comments I made that made more sense only to those of you that are normally familiar with calculus." "So, but if you don't know, if you're less familiar with calculus, don't worry about it." "I'm just going to tell you what you need to know in order to implement this algorithm and get it to work." "For the example that I want to use as a running example let's say that" "I have m = 4 training examples." "In order to implement this normal equation at big, what I'm going to do is the following." "I'm going to take my data set, so here are my four training examples." "In this case let's assume that, you know, these four examples is all the data I have." "What I am going to do is take my data set and add an extra column that corresponds to my extra feature, x0, that is always takes on this value of 1." "What I'm going to do is" "I'm then going to construct a matrix called X that's a matrix are basically contains all of the features from my training data, so completely here is my here are all my features and we're going to take all those numbers and" "put them into this matrix "X", okay?" "So just, you know, copy the data over one column at a time and then I am going to do something similar for y's." "I am going to take the values that I'm trying to predict and construct now a vector, like so and call that a vector y." "So X is going to be a m by (n+1) - dimensional matrix, and" "Y is going to be a m-dimensional vector where m is the number of training examples and n is, n is a number of features, n+1, because of this extra feature X0 that I had." "Finally if you take your matrix X and you take your vector Y, and if you just compute this, and set theta to be equal to" "X transpose X inverse times" "X transpose Y, this would give you the value of theta that minimizes your cost function." "There was a lot that happened on the slides and" "I work through it using one specific example of one dataset." "Let me just write this out in a slightly more general form and then let me just, and later on in this video let me explain this equation a little bit more." "It is not yet entirely clear how to do this." "In a general case, let us say we have M training examples so X1, Y1 up to" "Xn, Yn and n features." "So, each of the training example x(i) may looks like a vector like this, that is a n+1 dimensional feature vector." "The way I'm going to construct the matrix "X", this is also called the design matrix is as follows." "Each training example gives me a feature vector like this." "say, sort of n+1 dimensional vector." "The way I am going to construct my design matrix x is only construct the matrix like this." "and what I'm going to do is take the first training example, so that's a vector, take its transpose so it ends up being this, you know, long flat thing and make x1 transpose the first row of my design matrix." "Then I am going to take my second training example, x2, take the transpose of that and put that as the second row of x and so on, down until my last training example." "Take the transpose of that, and that's my last row of my matrix X. And, so, that makes my matrix X, an" "M by N +1 dimensional matrix." "As a concrete example, let's say I have only one feature, really, only one feature other than X zero, which is always equal to 1." "So if my feature vectors" "X-i are equal to this 1, which is X-0, then some real feature, like maybe the size of the house, then my design matrix, X, would be equal to this." "For the first row, I'm going to basically take this and take its transpose." "So, I'm going to end up with 1, and then X-1-1." "For the second row, we're going to end up with 1 and then" "X-1-2 and so on down to 1, and then X-1-M." "And thus, this will be a m by 2-dimensional matrix." "So, that's how to construct the matrix X. And, the vector Y--sometimes I might write an arrow on top to denote that it is a vector, but very often I'll just write this as Y, either way." "The vector Y is obtained by taking all all the labels, all the correct prices of houses in my training set, and just stacking them up into an M-dimensional vector, and that's Y. Finally, having constructed the matrix X" "and the vector Y, we then just compute theta as X'(1/X) x X'Y." "I just want to make" "I just want to make sure that this equation makes sense to you and that you know how to implement it." "So, you know, concretely, what is this X'(1/X)?" "Well, X'(1/X) is the inverse of the matrix X'X." "Concretely, if you were to say set A to be equal to X' x" "X, so X' is a matrix, X' x X gives you another matrix, and we call that matrix A. Then, you know, X'(1/X) is just you take this matrix A and you invert it, right!" "This gives, let's say 1/A." "And so that's how you compute this thing." "You compute X'X and then you compute its inverse." "We haven't yet talked about Octave." "We'll do so in the later set of videos, but in the" "Octave programming language or a similar view, and also the matlab programming language is very similar." "The command to compute this quantity," "X transpose X inverse times" "X transpose Y, is as follows." "In Octave X prime is the notation that you use to denote X transpose." "And so, this expression that's boxed in red, that's computing" "X transpose times X." "pinv is a function for computing the inverse of a matrix, so this computes" "X transpose X inverse, and then you multiply that by" "X transpose, and you multiply that by Y. So you end computing that formula which I didn't prove, but it is possible to show mathematically even though I'm not going to do so here, that this formula gives you" "the optimal value of theta in the sense that if you set theta equal to this, that's the value of theta that minimizes the cost function J of theta for the new regression." "One last detail in the earlier video." "I talked about the feature skill and the idea of getting features to be on similar ranges of" "Scales of similar ranges of values of each other." "If you are using this normal equation method then feature scaling isn't actually necessary and is actually okay if, say, some feature X one is between zero and one, and some feature X two is between ranges from zero to" "one thousand and some feature x three ranges from zero to ten to the minus five and if you are using the normal equation method this is okay and there is no need to do features scaling, although of course" "if you are using gradient descent, then, features scaling is still important." "Finally, where should you use the gradient descent and when should you use the normal equation method." "Here are some of the their advantages and disadvantages." "Let's say you have m training examples and n features." "One disadvantage of gradient descent is that, you need to choose the learning rate Alpha." "And, often, this means running it few times with different learning rate alphas and then seeing what works best." "And so that is sort of extra work and extra hassle." "Another disadvantage with gradient descent is it needs many more iterations." "So, depending on the details, that could make it slower, although there's more to the story as we'll see in a second." "As for the normal equation, you don't need to choose any learning rate alpha." "So that, you know, makes it really convenient, makes it simple to implement." "You just run it and it usually just works." "And you don't need to iterate, so, you don't need to plot J of Theta or check the convergence or take all those extra steps." "So far, the balance seems to favor normal the normal equation." "Here are some disadvantages of the normal equation, and some advantages of gradient descent." "Gradient descent works pretty well, even when you have a very large number of features." "So, even if you have millions of features you can run gradient descent and it will be reasonably efficient." "It will do something reasonable." "In contrast to normal equation, In, in order to solve for the parameters data, we need to solve for this term." "We need to compute this term, X transpose, X inverse." "This matrix X transpose X." "That's an n by n matrix, if you have n features." "Because, if you look at the dimensions of" "X transpose the dimension of" "X, you multiply, figure out what the dimension of the product is, the matrix X transpose" "X is an n by n matrix where n is the number of features, and for almost computed implementations the cost of inverting the matrix, rose roughly as the cube of the dimension of the matrix." "So, computing this inverse costs, roughly order, and cube time." "Sometimes, it's slightly faster than" "N cube but, it's, you know, close enough for our purposes." "So if n the number of features is very large, then computing this quantity can be slow and the normal equation method can actually be much slower." "So if n is large then I might usually use gradient descent because we don't want to pay this all in q time." "But, if n is relatively small, then the normal equation might give you a better way to solve the parameters." "What does small and large mean?" "Well, if n is on the order of a hundred, then inverting a hundred-by-hundred matrix is no problem by modern computing standards." "If n is a thousand, I would still use the normal equation method." "Inverting a thousand-by-thousand matrix is actually really fast on a modern computer." "If n is ten thousand, then I might start to wonder." "Inverting a ten-thousand- by-ten-thousand matrix starts to get kind of slow, and I might then start to maybe lean in the direction of gradient descent, but maybe not quite." "n equals ten thousand, you can sort of convert a ten-thousand-by-ten-thousand matrix." "But if it gets much bigger than that, then, I would probably use gradient descent." "So, if n equals ten to the sixth with a million features, then inverting a million-by-million matrix is going to be very expensive, and" "I would definitely favor gradient descent if you have that many features." "So exactly how large set of features has to be before you convert a gradient descent, it's hard to give a strict number." "But, for me, it is usually around ten thousand that I might start to consider switching over to gradient descents or maybe, some other algorithms that we'll talk about later in this class." "To summarize, so long as the number of features is not too large, the normal equation gives us a great alternative method to solve for the parameter theta." "Concretely, so long as the number of features is less than 1000, you know, I would use, I would usually is used in normal equation method rather than, gradient descent." "To preview some ideas that we'll talk about later in this course, as we get to the more complex learning algorithm, for example, when we talk about classification algorithm, like a logistic regression algorithm," "We'll see that those algorithm actually..." "The normal equation method actually do not work for those more sophisticated learning algorithms, and, we will have to resort to gradient descent for those algorithms." "So, gradient descent is a very useful algorithm to know." "The linear regression will have a large number of features and for some of the other algorithms that we'll see in this course, because, for them, the normal equation method just doesn't apply and doesn't work." "But for this specific model of linear regression, the normal equation can give you a alternative" "that can be much faster, than gradient descent." "So, depending on the detail of your algortithm, depending of the detail of the problems and how many features that you have, both of these algorithms are well worth knowing about." "In this video, I want to talk about the normal equation and non-invertibility." "This is a somewhat more advanced concept, but it is something that I've often been asked about." "And so I wanted to talk about it here." "But this is a somewhat more advanced concept, so feel free to consider this optional material" "There's a phenomenon that you may run into that's maybe for some of you useful to understand." "But even if you don't understand it, the normal equation and linear regression, you should really get that to work okay." "Here's the issue:" "For those of you that are maybe somewhat more familar with linear algebra, what some students have asked me is, when computing this theta equals ( Xtranspose X )inverse Xtranspose y what if the matrix Xtranspose X is non-invertible?" "So, for those of you that know a bit more linear algebra you may know that only some matrices are invertible and some matrices do not have an inverse we call those non-invertible matrices, singular or degenerate matrices." "The issue or the problem of Xtranpose X being non-invertible should happen pretty rarely." "And in Octave, if you implement this to compute theta, it turns out that this will actually do the right thing." "I'm getting a little bit technical now and I don't want to go into details, but Octave has two functions for inverting matrices:" "One is called pinv(), and the other is called inv()." "The differences between these two are somewhat technical." "One's called the pseudo-inverse, one's called the inverse." "You can show mathemically so as long as you use the pinv() function, then this will actually compute the value of theta that you want, even if Xtranspose X is non-invertible." "The specific details between what is the difference between pinv() and what is inv() that is somewhat advanced numerical computing concepts, that I don't really want to get into." "But I thought in this optional video I try to give you a little bit of intuition about what it means that Xtranspose X to be non-invertible." "For those of you that know a bit more linear algebra and might be interested." "I'm not going to proove this mathematically, but if Xtranspose X is non-invertible, there are usually two most common causes:" "The first cause is if somehow, in your learning problem, you have redundant features, concretely, if you try to predict housing prices and if x1 is the size of a house in square-feet, and x2 is the size of the house in square-meters," "then, you know, 1 meter is equal to 3.28 feet, rounded to two decimals, and so your two features will always satisfy the constraint that x1 equals 3(.28)^2 times x2." "And you can show, for those of you - this is somehwat advanced linear algebra now, but if you're an expert in linear algebra, you can actually show that if your two features are related via a linear equation like this," "then matrix Xtranspose X will be non-invertible." "The second thing that can cause Xtranspose X to be non-invertible is if you're trying to run a learning algorithm with a lot of a features." "Concretely, if m is less than or equal to n." "For example, if you imagine that you have m equals 10 training examples and that you have n equals 100 features, then you're trying to fit a parameter vector theta, which is (n+1)-dimensional," "so it's a 101-dimensional you're trying to fit a 101 parameters from just 10 training examples." "And this turns out to sometimes work, but to not always be a good idea." "Because, as we see later, you might not have enough data if you only have 10 examples to fit 100 or 101 parameters." "We'll see later in this course, why this might be too little data to fit this many parameters." "But commonly, what we do then if m is less than n, is to see if we can either delete some features or to use a technique called regularization, which is something that we will talk about a bit later in this course as well," "that will kind of let you fit a lot of parameters using a lot of features even if you have a relatively small training set." "But this regularization will be a later topic in this course." "But to summarize, if ever you find that Xtranspose X is singular or alternatively find is non-invertible, what I would recommend you do is first:" "look at your features and see if you have redundant features like these x1 and x2 being linearly dependent, or being a linear function of each other, like so and if you do have redundant features and if you just delete one of these features " "you really don't need both of these features, so if you just delete one of these features that will solve your non-invertibility problem and, so first think through my features and check if any are redundant and if so, then, you know, keep deleting the redundant features" "until they are no longer redundant." "And if your features are non redundant," "I would check if I might have too many features, and if that's the case I would either delete some features if I can bare to use fewer features, or else I would consider using regularization, which is this topic that we will talk about later." "So, that's it for the normal equation and what it means if the matrix Xtranspose X is non-invertible." "But this is a problem that hopefully you run into pretty rarely." "And if you just implement it in Octave using the pinv() function which is called the pseudo-inverse function so you use a different linear algebra library, that is called pseudo-inverse but that implementation should just do the right thing" "even if Xtranspose X is non-invertible which should happen pretty rarily anyway so this should not be a problem for most implementations of linear regression." "You now know a bunch about machine learning." "In this video, I like to teach you a programing language," "Octave, in which you'll be able to very quickly implement the the learning algorithms we've seen already, and the learning algorithms we'll see later in this course." "In the past, I've tried to teach machine learning using a large variety of different programming languages including C++ Java," "Python, NumPy, and also" "Octave, and what I found was that students were able to learn the most productively learn the most quickly and prototype your algorithms most quickly using a relatively high level language like octave." "In fact, what I often see in Silicon Valley is that if even if you need to build." "If you want to build a large scale deployment of a learning algorithm, what people will often do is prototype and the language is Octave." "Which is a great prototyping language." "So you can sort of get your learning algorithms working quickly." "And then only if you need to a very large scale deployment of it." "Only then spend your time re-implementing the algorithm to C++ Java or some of the language like that." "Because all the lessons we've learned is that a time or develop a time." "That is your time." "The machine learning's time is incredibly valuable." "And if you can get your learning algorithms to work more quickly in Octave." "Then overall you have a huge time savings by first developing the algorithms in" "Octave, and then implementing and maybe C++ Java, only after we have the ideas working." "The most common prototyping language I see people use for machine learning are:" "Octave, MATLAB," "Python, NumPy, and R." "Octave is nice because open sourced." "And MATLAB works well too, but it is expensive for" "to many people." "But if you have access to a copy of MATLAB." "You can also use MATLAB with this class." "If you know Python, NumPy, or if you know R. I do see some people use it." "But, what I see is that people usually end up developing somewhat more slowly, and you know, these languages." "Because the Python, NumPy syntax is just slightly clunkier than the Octave syntax." "And so because of that, and because we are releasing starter code in Octave." "I strongly recommend that you not try to do the following exercises in this class in NumPy and R." "But that I do recommend that you instead do the programming exercises for this class in octave instead." "What I'm going to do in this video is go through a list of commands very, very quickly, and its goal is to quickly show you the range of commands and the range of things you can do in Octave." "The course website will have a transcript of everything I do, and so after watching this video you can refer to the transcript posted on the course website when you want find a command." "Concretely, what I recommend you do is first watch the tutorial videos." "And after watching to the end, then install Octave on your computer." "And finally, it goes to the course website, download the transcripts of the things you see in the session, and type in whatever commands seem interesting to you into Octave, so that it's running on your own computer, so" "you can see it run for yourself." "And with that let's get started." "Here's my Windows desktop, and I'm going to start up Octave." "And I'm now in Octave." "And that's my Octave prompt." "Let me first show the elementary operations you can do in Octave." "So you type in 5 + 6." "That gives you the answer of 11." "3 - 2." "5 x 8, 1/2, 2^6 is 64." "So those are the elementary math operations." "You can also do logical operations." "So one equals two." "This evaluates to false." "The percent command here means a comment." "So, one equals two, evaluates to false." "Which is represents by zero." "One not equals to two." "This is true." "So that returns one." "Note that a not equal sign is this tilde equals symbol." "And not bang equals." "Which is what some other programming languages use." "Lets see large operations one and zero use a double ampersand sign to the logical AND." "And that evaluates false." "One or zero is the OR operation." "And that evaluates to true." "And I can XOR one and zero, and that evaluates to one." "This thing over on the left, this Octave 324.x equals 11, this is the default Octave prompt." "It shows the, what, the version in Octave and so on." "If you don't want that prompt, there's a somewhat cryptic command PF quote, greater than, greater than and so on, that you can use to change the prompt." "And I guess this quote a string in the middle." "Your quote, greater than, greater than, space." "That's what I prefer my Octave prompt to look like." "So if I hit enter." "Oops, excuse me." "Like so." "PS1 like so." "Now my Octave prompt has changed to the greater than, greater than sign.Which, you know, looks quite a bit better." "Next let's talk about Octave variables." "I can take the variable" "A and assign it to 3." "And hit enter." "And now A is equal to 3." "You want to assign a variable, but you don't want to print out the result." "If you put a semicolon, the semicolon suppresses the print output." "So to do that, enter, it doesn't print anything." "Whereas A equals 3." "mix it, print it out, where A equals, 3 semicolon doesn't print anything." "I can do string assignment." "B equals high." "Now if I just enter B it prints out the variable B. So B is the string high." "C equals 3 greater than colon 1." "So, now C evaluates the true." "If you want to print out or display a variable, here's how you go about it." "Let me set A equals Pi." "And if I want to print" "A I can just type A like so, and it will print it out." "For more complex printing there is also the DISP command which stands for Display." "Display A just prints out A like so." "You can also display strings so:" "DISP, sprintf, two decimals, percent 0.2," "F, comma, A. Like so." "And this will print out the string." "Two decimals, colon, 3.14." "This is kind of an old style C syntax." "For those of you that have programmed C before, this is essentially the syntax you use to print screen." "So the Sprintf generates a string that is less than the 2 decimals, 3.1 plus string." "This percent 0.2 F means substitute A into here, showing the two digits after the decimal points." "And DISP takes the string" "DISP generates it by the Sprintf command." "Sprintf." "The Sprintf command." "And DISP actually displays the string." "And to show you another example, Sprintf six decimals percent 0.6 F comma A." "And, this should print Pi with six decimal places." "Finally, I was saying, a like so, looks like this." "There are useful shortcuts that type type formats long." "It causes strings by default." "Be displayed to a lot more decimal places." "And format short is a command that restores the default of just printing a small number of digits." "Okay, that's how you work with variables." "Now let's look at vectors and matrices." "Let's say I want to assign MAT A to the matrix." "Let me show you an example: 1, 2, semicolon, 3, 4, semicolon, 5, 6." "This generates a three by two matrix A whose first row is 1, 2." "Second row 3, 4." "Third row is 5, 6." "What the semicolon does is essentially say, go to the next row of the matrix." "There are other ways to type this in." "Type A 1, 2 semicolon" "3, 4, semicolon, 5, 6, like so." "And that's another equivalent way of assigning A to be the values of this three by two matrix." "Similarly you can assign vectors." "So V equals 1, 2, 3." "This is actually a row vector." "Or this is a 3 by 1 vector." "Where that is a fat Y vector, excuse me, not, this is" "a 1 by 3 matrix, right." "Nothing by 1." "If I want to assign this to a column vector, what I would do instead is do v 1;2;3." "And this will give me a 3 by 1." "There's a 1 by 3 vector." "So this will be a column vector." "Here's some more useful notation." "V equals 1: 0.1: 2." "What this does is it sets V to the bunch of elements that start from 1." "And increments and steps of 0.1 until you get up to 2." "So if I do this, V is going to be this, you know, row vector." "This is what one by eleven matrix really." "That's 1, 1.1, 1.2, 1.3 and so on until we" "get up to two." "Now, and I can also set V equals one colon six, and that sets V to be these numbers." "1 through 6, okay." "Now here are some other ways to generate matrices." "Ones 2.3 is a command that generates a matrix that is a two by three matrix that is the matrix of all ones." "So if I set that c2 times ones two by three this generates a two by three matrix that is all two's." "You can think of this as a shorter way of writing this and c2,2,2's and you can call them 2,2,2, which would also give you the same result." "Let's say W equals one's, one by three, so this is going to be a row vector or a row of three one's and similarly you can also say w equals zeroes, one by three, and this generates a matrix." "A one by three matrix of all zeros." "Just a couple more ways to generate matrices ." "If I do W equals" "Rand one by three, this gives me a one by three matrix of all random numbers." "If I do Rand three by three." "This gives me a three by three matrix of all random numbers drawn from the uniform distribution between zero and one." "So every time I do this, I get a different set of random numbers drawn uniformly between zero and one." "For those of you that know what a Gaussian random variable is or for those of you that know what a normal random variable is, you can also set W equals Rand N, one by three." "And so these are going to be three values drawn from a Gaussian distribution with mean zero and variance or standard deviation equal to one." "And you can set more complex things like W equals minus six, plus the square root ten, times, lets say" "Rand N, one by ten thousand." "And I'm going to put a semicolon at the end because I don't really want this printed out." "This is going to be a what?" "Well, it's going to be a vector of, with a hundred thousand, excuse me, ten thousand elements." "So, well, actually, you know what?" "Let's print it out." "So this will generate a matrix like this." "Right?" "With 10,000 elements." "So that's what W is." "And if I now plot a histogram of W with a hist command, I can now." "And Octave's print hist command, you know, takes a couple seconds to bring this up, but this is a histogram of my random variable for W." "There was minus 6 plus zero ten times this Gaussian random variable." "And I can plot a histogram with more buckets, with more bins, with say, 50 bins." "And this is my histogram of a Gaussian with mean minus 6." "Because I have a minus 6 there plus square root 10 times this." "So the variance of this Gaussian random variable is 10 on the standard deviation is square root of 10, which is about what?" "Three point one." "Finally, one special command for generator matrix, which is the I command." "So I stands for this is maybe a pun on the word identity." "It's server set eye 4." "This is the 4 by 4 identity matrix." "So I equals eye 4." "This gives me a 4 by 4 identity matrix." "And I equals eye 5, eye 6." "That gives me a 6 by 6 identity matrix, i3 is the 3 by 3 identity matrix." "Lastly, to wrap up this video, there's one more useful command." "Which is the help command." "So you can type help i and this brings up the help function for the identity matrix." "Hit Q to quit." "And you can also type help rand." "Brings up documentation for the rand or the random number generation function." "Or even help help, which shows you, you know help on the help function." "So, those are the basic operations in Octave." "And with this you should be able to generate a few matrices, multiply, add things." "And use the basic operations in Octave." "In the next video, I'd like to start talking about more sophisticated commands and how to use data around and start to process data in Octave." "In this second tutorial video on" "Octave, I'd like to start to tell you how to move data around in Octave." "So, if you have data for a machine learning problem, how do you load that data in Octave?" "How do you put it into matrix?" "How do you manipulate these matrices?" "How do you save the results?" "How do you move data around and operate with data?" "Here's my Octave window as before, picking up from where we left off in the last video." "If I type A, that's the matrix so we generate it, right, with this command equals one, two, three, four, five, six, and this is a three by two matrix." "The size command in Octave lets you, tells you what is the size of a matrix." "So size A returns three, two." "It turns out that this size command itself is actually returning a one by two matrix." "So you can actually set SZ equals size of A and SZ is now a one by two matrix where the first element of this is three, and the second element of this is two." "So, if you just type size of SZ." "Does SZ is a one by two matrix whose two elements contain the dimensions of the matrix A. You can also type size A one to give you back the first dimension of A, size of the first dimension of A." "So that's the number of rows and size A two to give you back two, which is the number of columns in the matrix A. If you have a vector V, so let's say V equals one, two," "three, four, and you type length V. What this does is it gives you the size of the longest dimension." "So you can also type length A and because" "A is a three by two matrix, the longer dimension is of size three, so this should print out three." "But usually we apply length only to vectors." "So you know, length one, two, three, four, five, rather than apply length to matrices because that's a little more confusing." "Now, let's look at how the low data and find data on the file system." "When we start an Octave we're usually, we're often in a path that is, you know, the location of where the Octave location is." "So the PWD command shows the current directory, or the current path that Octave is in." "So right now we're in this maybe somewhat off scale directory." "The CD command stands for change directory, so I can go to C:/Users/Ang/Desktop, and now I'm in, you know, in my Desktop" "and if I type OS," "OS is, it comes from a Unix or a Linux command." "But, OS will list the directories on my desktop and so these are the files that are on my Desktop right now." "In fact, on my desktop are two files:" "Features X and" "Price Y that's maybe come from a machine learning problem I want to solve." "So, here's my desktop." "Here's Features X, and" "Features X is this window, excuse me, is this file with two columns of data." "This is actually my housing prices data." "So I think, you know, I think I have forty-seven rows in this data set." "And so the first house has size two hundred four square feet, has three bedrooms; second house has sixteen hundred square feet, has three bedrooms; and so on." "And Price Y is this file that has the prices of the data in my training set." "So, Features X and" "Price Y are just text files with my data." "How do I load this data into Octave?" "Well, I just type the command load Features X dot dat and if I do that, I load the Features X and can load Price Y dot dat." "And by the way, there are multiple ways to do this." "This command if you put" "Features X dot dat on that in strings and load it like so." "This is a typo there." "This is an equivalent command." "So you can, this way I'm just putting the file name of the string in the founding in a string and in an" "Octave use single quotes to represent strings, like so." "So that's a string, and we can load the file whose name is given by that string." "Now the WHO command now shows me what variables I have in my Octave workspace." "So Who shows me whether the variables that Octave has in memory currently." "Features X and Price Y are among them, as well as the variables that, you know, we created earlier in this session." "So I can type Features X to display features X. And there's my data." "And I can type size features" "X and that's my 47 by two matrix." "And some of these size, press" "Y, that gives me my 47 by one vector." "This is a 47 dimensional vector." "This is all common vector that has all the prices Y in my training set." "Now the who function shows you one of the variables that, in the current workspace." "There's also the who S variable that gives you the detailed view." "And so this also, with an S at the end this also lists my variables except that it now lists the sizes as well." "So A is a three by two matrix and features" "X as a 47 by 2 matrix." "Price Y is a 47 by one matrix." "Meaning this is just a vector." "And it shows, you know, how many bytes of memory it's taking up." "As well as what type of data this is." "Double means double position floating point so that just means that these are real values, the floating point numbers." "Now if you want to get rid of a variable you can use the clear command." "So clear features X and type whose again." "You notice that the features X variable has now disappeared." "And how do we save data?" "Let's see." "Let's take the variable V and say that it's a price Y 1 colon 10." "This sets V to be the first 10 elements of" "vector Y. So let's type who or whose." "Whereas Y was a 47 by 1 vector." "V is now 10 by 1." "B equals price Y, one column ten that sets it to the just the first ten elements of Y. Let's say" "I wanna save this to date to disc the command save, hello.mat" "V. This will save the variable V into a file called hello.mat." "So let's do that." "And now a file has appeared on my Desktop, you know, called Hello.mat." "I happen to have MATLAB installed in this window, which is why, you know, this icon looks like this because Windows is recognized as it's a MATLAB file,but don't worry about it if this file" "looks like it has a different icon on your machine and let's say I clear all my variables." "So, if you type clear without anything then this actually deletes all of the variables in your workspace." "So there's now nothing left in the workspace." "And if I load hello.mat," "I can now load back my variable v, which is the data that I previously saved into the hello.mat file." "So, hello.mat, what we did just now to save hello.mat to view, this save the data in a binary format, a somewhat more compressed binary format." "So if v is a lot of data, this, you know, will be somewhat more compressing." "Will take off less the space." "If you want to save your data in a human readable format then you type save hello.text the variable v and then -ascii." "So, this will save it as a text or as ascii format of text." "And now, once I've done that, I have this file." "Hello.text has just appeared on my desktop, and if I open this up, we see that this is a text file with my data saved away." "So that's how you load and save data." "Now let's talk a bit about how to manipulate data." "Let's set a equals to that matrix of the game so is my three by two matrix." "So as indexing." "So type A 3, 2." "This indexes into the 3, 2 elements of the matrix A. So, this is what, you know, in normally, we will write this as a subscript 3, 2 or A subscript," "you know, 3, 2 and so that's the element and third row and second column of A which is the element of six." "I can also type A to comma colon to fetch everything in the second row." "So, the colon means every element along that row or column." "So, a of 2 comma colon is this second row of a." "Right." "And similarly, if I do a colon comma 2 then this means get everything in the second column of A. So, this gives me 2 4 6." "Right this means of" "A. everything, second column." "So, this is my second column A, which is 2 4 6." "Now, you can also use somewhat most of the sophisticated index in the operations." "So So, we just click each of an example." "You do this maybe less often, but let me do this A 1 3 comma colon." "This means get all of the elements of A who's first indexes one or three." "This means I get everything from the first and third rows of" "A and from all columns." "So, this was the matrix A and so A" "1 3 comma colon means get everything from the first row and from the second row and from the third row and the colon means, you know, one both of first and the second columns and so this gives me this 1 2 5 6." "Although, you use the source of more subscript index" "operations maybe somewhat less often." "To show you what else we can do." "Here's the A matrix and this source A colon, to give me the second column." "You can also use this to do assignments." "So I can take the second column of" "A and assign that to 10, 11, 12, and if I do that I'm now, you know, taking the second column of a and I'm assigning this column vector 10, 11, 12 to it." "So, now a is this matrix that's 1, 3, 5." "And the second column has been replaced by 10, 11, 12." "And here's another operation." "Let's set A to be equal to A comma 100, 101," "102 like so and what this will do is depend another column vector" "to the right." "So, now, oops." "I think I made a little mistake." "Should have put semicolons there and now A is equals to this." "Okay?" "I hope that makes sense." "So this 100, 101, 102." "This is a column vector and what we did was we set A, take" "A and set it to the original definition." "And then we put that column vector to the right and so, we ended up taking the matrix A and--which was" "these six elements on the left." "So we took matrix" "A and we appended another column vector to the right;" "which is now why A is a three by three matrix that looks like that." "And finally, one neat trick that I sometimes use if you do just a and just a colon like so." "This is a somewhat special case syntax." "What this means is that put all elements with A into a single column vector and this gives me a 9 by 1 vector." "They adjust the other ones are combined together." "Just a couple more examples." "Let's see." "Let's say I set A to be equal to 123456, okay?" "And let's say" "I set a B to B equal to 11, 12, 13, 14, 15, 16." "I can create a new matrix C as A B." "This just means my" "Matrix A. Here's my Matrix" "B and I've set C to be equal to AB." "What I'm doing is I'm taking these two matrices and just concatenating onto each other." "So the left, matrix A on the left." "And I have the matrix B on the right." "And that's how I formed this matrix C by putting them together." "I can also do C equals" "A semicolon B. The semi colon notation means that" "I go put the next thing at the bottom." "So, I'll do is a equals semicolon B. It also puts the matrices A and B together except that it now puts them on top of each other." "so now I have A on top and B at the bottom and C here is now in 6 by 2 matrix." "So, just say the semicolon thing usually means, you know, go to the next line." "So, C is comprised by a and then go to the bottom of that and then put b in the bottom and by the way, this A B is the same as A, B and so you know, either of these gives you the same result." "So, with that, hopefully you now know how to construct matrices and hopefully starts to show you some of the commands that you use to quickly put together matrices and take matrices and, you know, slam them together to form" "bigger matrices, and with just a few lines of code, Octave is very convenient in terms of how quickly we can assemble complex matrices and move data around." "So that's it for moving data around." "In the next video we'll start to talk about how to actually do complex computations on this, on our data." "So, hopefully that gives you a sense of how, with just a few commands, you can very quickly move data around in Octave." "You know, you load and save vectors and matrices, load and save data, put together matrices to create bigger matrices, index into or select specific elements on the matrices." "I know I went through a lot of commands, so I think the best thing for you to do is afterward, to look at the transcript of the things I was typing." "You know, look at it." "Look at the coursework site and download the transcript of the session from there and look through the transcript and type some of those commands into Octave yourself" "and start to play with these commands and get it to work." "And obviously, you know, there's no point at all to try to memorize all these commands." "It's just, but what you should do is, hopefully from this video you have gotten a sense of the sorts of things you can do." "So that when later on when you are trying to program a learning algorithms yourself, if you are trying to find a specific command that maybe you think" "Octave can do because you think you might have seen it here, you should refer to the transcript of the session and look through that in order to find the commands you wanna use." "So, that's it for moving data around and in the next video what I'd like to do is start to tell you how to actually do complex computations on our data, and how to compute on the data, and" "actually start to implement learning algorithms." "Now that you know how to load and save data in Octave, put your data into matrices and so on." "In this video I'd like to show you how to do computational operations on data and later on we'll be using this sorts of computation operations to implement our learning algorithms." "Let's get started." "Here's my Octave window." "Let me just quickly initialize some variables to use for examples and set A to be a 3 by 2 matrix." "and set B to a 3 by 2 matrix and let's set C to a" "2 by 2 matrix, like so." "Now, let's say I want to multiply 2 of my matrices." "So, let's say I wanna compute AxC." "I just type AxC." "So, it's a 3 by 2 matrix times a 2 by 2 matrix." "This gives me this 3 by 2 matrix." "You can also do elements wise operations and do A.xB" "and what this would do is they'll take each elements of A and multiply it by the corresponding elements of B." "So, that's A, that's B, that's A.xB." "So, for example, the first element gives 1 times 11 which gives 11." "The second element gives 2 x 12 which gives 24 and so on." "So it is the element wise multiplication of two matrices, and in general the P rand tends to, it's usually used, to denote element wise operations in octave." "So, here's a matrix" "A and I'll do A dot carry 2." "This gives me the multi, the element wise squaring of" "A, so 1 squared is 1, 2 squared is 4 and so on." "Let's set V to a vector, we'll set V as 123 as a column vector." "You can also do 1." "over V to do the element wise reciprocal of" "V so this gives me one over one, one over two and one over three." "This works too for matrices so one dot over A, gives me that element wise inverse of" "A. and once again the P radians gives use a clue that this is an elements wise operation." "To also do things like log" "V This is an element wise logarithm of, the" "V, E to the" "V, is the base E exponentiation of these elements of this is E, this is E squared EQ, this is" "V. And I can also do apps V to take the element wise absolute value of V. So here," "V was all positive, abs, say minus 1 to minus 3, the element wise Absolute value gives me back these non-negative values and negative" "V gives me the minus of V. This is the same as -1xV but usually you just write negative V and so that negative 1xV and what else can you do?" "Here's another neat trick." "So Let's see." "Let's say I want to take V and increment each of these elements by 1." "Well, one way to do it is by constructing a" "3 by 1 vector this all ones and adding that to V. So, they do that." "This increments V by for 123 to 234." "The way I did that was length of V, is three." "So ones, length of" "V by one, this is ones of three by one." "So that's ones, three by one." "On the right and what I did was B plus ones," "V by one, which is adding this vector of all ones to B. And so this increments" "V by one." "And you, another simpler way to do that is to type V+ one, right?" "So that's V and" "V+ one also means to add one element wise to each of my elements of V." "Now, let's talk about more operations." "So, here's my matrix A. If you want to write A transpose." "The way to do that is to write A prime." "That's the apostrophe symbol." "It's the left quote." "So, on your keyboard you probably have a left quote and a right quote." "So this is a at the standard quotation mark is a, what to say, a transpose to excuse me the, you know, a transpose of my major and of course a transpose if I transpose that again then I should" "get back my matrix A. Some more useful functions." "Let's say locate A is 1 15 to 0.5." "So, it's a, you know, 1 by 4 matrix." "Let's say set val equals max of A. This returns the maximum value of A, which in this case is 15 and" "I can do val ind max" "A. And this returns val of int which are the maximum value of A which is 15 as was the index." "So the elements number two of A that 15." "So, in is my index into this." "Just as a warning: if you do max A where A is a matrix." "What this does is this actually does the column wise maximum," "but say a little bit more about this in a second." "So, using this example of the variable lowercase A. If I do A less than three." "This does the element wise operation." "Element wise comparison." "So, the first element" "Of A is less than three equals to one." "Second elements of A is not less than three, so this value is zero, because it is also." "The third and fourth numbers of" "A are the lesson," "I meant less than three, third and fourth elements are less than three." "So this is one, one, so this is just the element wide comparison of all four element variable lower case three and it returns true or false depending on whether or not it's less than three." "Now, if I do find" "A less than three, this would tell me which are the elements of A that the variable A of less than three and in this case the 1st, 3rd and 4th elements are lesson three." "For my next example Oh, let me set eight be code to magic three." "The magic function returns." "Let's type help magic." "The magic function returns." "Functions called Returns this matrices called magic squares." "They have this, you know, mathematical property that all of their rows and columns and diagonals sum up to the same thing." "So, you know, it's not actually useful for machine learning as far as I know, but I'm just using this as a convenient way, you know, to generate a 3 by 3 matrix and this magic square screen." "We have the power of 3 at each row, each column and the diagonals all add up to the same thing, so it's kind of a mathematical construct." "I use magic, I use this magic function only when I'm doing demos, or when I'm teaching Octave like this and" "I don't actually use it for any, you know, useful machine learning application." "But, let's see, if I type RC equals find A greater than or equals 7." "This finds all the elements of a that are greater than and equals to 7 and so, RC sense a row and column." "So, the 11 element is greater than 7." "The three two elements is greater than 7 and the two 3 elements is greater than 7." "So let's see, the two, three element for example, is A two, three." "Is seven, is this element out here, and that is indeed greater than or equal seven." "By the way, I actually don't even memorize myself what these find functions do in the all these things do myself and whenever I use a find function, sometimes I forget myself exactly what does, and you know, type help find to look up the document." "Okay, just two more things, if it's okay, to show you." "One is the sum function." "So here's my A and" "I type sum A. This adds up all the elements of A." "And if I want to multiply them together, I type prod A." "Prod sense of product, and it returns the products of these four elements of A." "Floor A rounds down, these elements of A, so zero" "O point five gets rounded down to zero." "And cell, or ceiling A, gets rounded up, so zero point five, rounded up to the nearest integer, so zero point five gets rounded up to one." "You can also." "Let's see." "Let me type rand 3." "This generally sets a 3 by 3 matrix." "If I type max randd 3, rand 3." "What this does is it takes the element wise maximum of" "2 random 3 by 3 matrices." "So, you'll notice all these numbers tend to be a bit on the large side because each of these is actually the max of a randomly, of element Y's max of two randomly generated matrices." "This is my magic number." "This was my magic square 3x3a." "Let's say I type max A and then this will be it." "Open, close, square brackets comma 1." "What this does is this takes the column wise maximum." "So, the maximum of the first column is eight, max of the second column is nine, the max of the third column is seven." "This 1 means to take the max along the first dimension of" "A. In contrast, if" "I were to type max a, this funny notation 2 then this takes the per row maximum." "So, the maximum for the first row is 8, max of second row is 7, max of the third row is 9 and so this allows you to take maxes." "You know, per row or per column." "And if you want to, and remember it defaults to column mark wise elements on this, so if you want to find the maximum element in the entire matrix A, you can type max of max of A, like so, which is nine." "Or you can turn A into a vector and type max of A colon, like so, this treats this as a vector and takes the max element of vector." "Finally, let's set A to be a nine by nine magic square." "So remember, the magic square has this property that every column in every row sums the same thing and also the diagonals." "So here is 9X9 magic square." "So let me just sum A one so this does a per column sum." "And so I'm going to take each column of A and add them up and this, you know, lets us verify that indeed for 9 by 9 magic square." "Every column adds up to 369 as of the same thing." "Now, let's do the row wise sum." "So, the sum A comma 2 and this sums up each row of A and each row of A also sums up to 369." "Now let's sum the diagonal elements of A and make sure that they, that that also sums up to the same thing." "So what I'm going to do is, construct a nine by nine identity matrix, that's" "I9, and let me take A and construct, multiply" "A elements wise." "So here's my matrix of A." "I'm gonna do A.xI9 and what this will do is take the element wise product of these" "2 matrices, and so this should wipe out everything except for the diagonal entries and now" "I'm going to sum, sum of" "A of that and this gives me the sum of these diagonal elements, and indeed it is 369." "You can sum up the other diagonal as well." "So this top left to bottom right." "You can sum up the opposite diagonal from bottom left to top right." "The sum, the commands for this is somewhat more cryptic." "You don't really need to know this." "I'm just showing you just in case any of you are curious, but let's see." "Flip UD stands for flip up/down." "If you do that, that turns out to sum up the elements in the opposites of, the other diagonal that also sums up to 369." "Here, let me show you, whereas i9 is this matrix, flip up/down of i9, you know, takes the identity matrix and flips it vertically so you end up with, excuse me, flip UD, end up" "with ones on this opposite diagonal as well." "Just one last command and then that's it, and then that will be it for this video." "Let's say A to be the 3x3 magic square again." "If you want to invert the matrix, you type P inv A, this is typically called a pseudo inference, but it doesn't matter." "Think of it as basically the inverse of A and that's the inverse of A and second set, you know, 10 equals p" "of A and of temp times" "A. This is indeed the identity matrix with essentially ones on the diagonals and zeros on the off-diagonals, up to a numerical round-off." "So, that's it for how to do different computational operations on the data in matrices." "And after running a learning algorithm, often one of the most useful things is to be able to look at your results, or to plot, or visualize your result." "And in the next video I'm going to very quickly show you how, again, with one or two lines of code using Octave you can quickly visualize your data, or plot your data and use that to better" "understand, you know, what your learning algorithms are doing." "When developing learning algorithms, very often a few simple plots can give you a better sense of what the algorithm is doing and just sanity check that everything is going okay and the algorithms doing what is supposed to." "For example, in an earlier video, I talked about how plotting the cost function J of theta can help you make sure that gradient descent is converging." "Often, plus of the data or of all the learning algorithm outputs will also give you ideas for how to improve your learning algorithm." "Fortunately, Octave has very simple tools to generate lots of different plots and when" "I use learning algorithms, I find that plotting the data, plotting the learning algorithm and so on are often an important part of how I get ideas for improving the algorithms and in this video," "I'd like to show you some of these Octave tools for plotting and visualizing your data." "Here's my Octave window." "Let's quickly generate some data for us to plot." "So I'm going to set T to be equal to, you know, this array of numbers." "Here's T, set of numbers going from 0 up to .98." "Let's set y1 equals sine of 2 pie 40 and" "if I want to plot the sine function, it's very easy." "I just type plot T comma Y 1 and hit enter." "And up comes this plot where the horizontal axis is the T variable and the vertical axis is y1, which is the sine you saw in the function that we just computed." "Let's set y2 to be equal to the cosine of two pi, four T, like so." "And if I plot" "T comma y2, what octave will I do is I'll take my sine plot and it will replace with this cosine function and now, you know, cosine of xi of 1." "Now, what if I want to have both the sine and the cosine plots on top of each other?" "What I'm going to do is I'm going to type plot t,y1." "So here's my sine function, and then" "I'm going to use the function hold on." "And what hold does it closes octaves to now figures on top of the old one and let me now plot t y2." "I'm going to plot the cosine function in a different color." "So, let me put there r in quotation marks there and instead of replacing the current figure, I'll plot the cosine function on top and the r indicates the what is an event color." "And here additional commands - x label times, to label the X axis, or the horizontal axis." "And Y label values A, to label the vertical axis value," "and I can also" "label my two lines with this command:" "legend sine cosine and this puts this legend up on the upper right showing what the 2 lines are, and finally title" "my plot is the title at the top of this figure." "Lastly, if you want to save this figure, you type print -dpng" "myplot.png. So PNG is a graphics file format, and if you do this it will let you save this as a file." "If I do that, let me actually change directory to, let's see, like that, and then I will print that out." "So this will take a while depending on how your Octave configuration is setup, may take a few seconds, but change directory to my desktop and Octave is now taking a few seconds to save this." "If I now go to my desktop, Let's hide these windows." "Here's myplot.png which Octave has saved, and you know, there's the figure saved as the PNG file." "Octave can save thousand other formats as well." "So, you can type help plot, if you want to see the other file formats, rather than" "PNG, that you can save figures in." "And lastly, if you want to get rid of the plot, the close command causes the figure to go away." "As I figure if I type close, that figure just disappeared from my desktop." "Octave also lets you specify a figure and numbers." "You type figure 1 plots t, y1." "That starts up first figure, and that plots t, y1." "And then if you want a second figure, you specify a different figure number." "So figure two, plot t, y2 like so, and now on my desktop, I actually have 2 figures." "So, figure 1 and figure 2 thus 1 plotting the sine function, 1 plotting the cosine function." "Here's one other neat command that" "I often use, which is the subplot command." "So, we're going to use subplot 1 2 1." "What it does it sub-divides the plot into a one-by-two grid with the first 2 parameters are, and it starts to access the first element." "That's what the final parameter 1 is, right?" "So, divide my figure into a one by two grid, and I want to access the first element right now." "And so, if I type that in, this product, this figure, is on the left." "And if I plot t, y1, it now fills up this first element." "And if I I'll do subplot 122." "I'm going to start to access the second element and plot t, y2." "Well, throw in y2 in the right hand side, or in the second element." "And last command, you can also change the axis scales and change axis these to 1.51" "minus 1 1 and this sets the x range and y range for the figure on the right, and concretely, it assess the horizontal" "major values in the figure on the right to make sure 0.5 to 1, and the vertical axis values use the range from minus one to one." "And, you know, you don't need to memorize all these commands." "If you ever need to change the access or you need to know is that, you know, there's an access command and you can already get the details from the usual octave help command." "Finally, just a couple last commands CLF clear is a figure and here's one unique trait." "Let's set a to be equal to a 5 by 5 magic squares a." "So, a is now this 5 by 5 matrix does a neat trick that I sometimes use to visualize the matrix, which is" "I can use image sc of a what this will do is plot a five by five matrix, a five by five grid of color." "where the different colors correspond to the different values in the A matrix." "So concretely, I can also do color bar." "Let me use a more sophisticated command, and image sc" "A color bar color map gray." "This is actually running three commands at a time." "I'm running image sc then running color bar, then running color map gray." "And what this does, is it sets a color map, so a gray color map, and on the right it also puts in this color bar." "And so this color bar shows what the different shades of color correspond to." "Concretely, the upper left element of the A matrix is 17, and so that corresponds to kind of a mint shade of gray." "Whereas in contrast the second element of A--sort of the" "1 2 element of A--is 24." "Right, so it's A 1 2 is 24." "So that corresponds to this square out here, which is nearly a shade of white." "And the small value, say" "A--what is that?" "A" "4 5, you know, is a value 3 over here that corresponds-- you can see on my color bar that it corresponds to a much darker shade in this image." "So here's another example," "I can plot a larger, you know, here's a magic 15 that gives you a 15 by 15 magic square and this gives me a plot of what my 15 by 15 magic squares values looks like." "And finally to wrap up this video, what you've seen me do here is use comma chaining of function calls." "Here's how you actually do this." "If I type A equals 1, B equals 2, C equals" "3, and hit Enter, then this is actually carrying out three commands at the same time." "Or really carrying out three commands, one after another," "and it prints out all three results." "And this is a lot like" "A equals 1, B equals 2, C equals 3, except that if I use semicolons instead of a comma, it doesn't print out anything." "So, this, you know, this thing here we call comma chaining of commands, or comma chaining of function calls." "And, it's just another convenient way in Octave to put multiple commands like image sc color bar, colon map to put multi-commands on the same line." "So, that's it." "You now know how to plot different figures and octave, and in next video the next main piece that I want to tell you about is how to write control statements like if, while, for statements and octave as well as hard to define and use functions" "In this next set of videos, I would like to tell you about recommender systems." "There are two reasons, I had two motivations for why I wanted to talk about recommender systems." "The first is just that it is an important application of machine learning." "Over the last few years, occasionally I visit different, you know, technology companies here in Silicon Valley and I often talk to people working on machine learning applications there and so I've asked people what are the most" "important applications of machine learning or what are the machine learning applications that you would most like to get an improvement in the performance of." "And one of the most frequent answers I heard was that there are many groups out in Silicon" "Valley now, trying to build better recommender systems." "So, if you think about what the websites are like Amazon, or what Netflix or what eBay, or what iTunes Genius, made by Apple does, there are many websites or systems that try to recommend new products to use." "So, Amazon recommends new books to you, Netflix try to recommend new movies to you, and so on." "And these sorts of recommender systems, that look at what books you may have purchased in the past, or what movies you have rated in the past, but these are the systems that are responsible" "for today, a substantial fraction of" "Amazon's revenue and for a company like Netflix, the recommendations" "that they make to the users is also responsible for a substantial fraction of the movies watched by their users." "And so an improvement in performance of a recommender system can have a substantial and immediate impact on the bottom line of many of these companies." "Recommender systems is kind of a funny problem, within academic machine learning so that we could go to an academic machine learning conference," "the problem of recommender systems, actually receives relatively little attention, or at least it's sort of a smaller fraction of what goes on within Academia." "But if you look at what's happening, many technology companies, the ability to build these systems seems to be a high priority for many companies." "And that's one of the reasons why I want to talk about them in this class." "The second reason that I want to talk about recommender systems is that as we approach the last few sets of videos" "of this class I wanted to talk about a few of the big ideas in machine learning and share with you, you know, some of the big ideas in machine learning." "And we've already seen in this class that features are important for machine learning, the features you choose will have a big effect on the performance of your learning algorithm." "So there's this big idea in machine learning, which is that for some problems, maybe not all problems, but some problems, there are algorithms that can try to automatically learn a good set of features for you." "So rather than trying to hand design, or hand code the features, which is mostly what we've been doing so far, there are a few settings where you might be able to have an algorithm, just to learn what feature to" "use, and the recommender systems is just one example of that sort of setting." "There are many others, but engraved through recommender systems, will be able to go a little bit into this idea of learning the features and you'll be able to see at least one example of this, I think, big idea in machine learning as well." "So, without further ado, let's get started, and talk about the recommender system problem formulation." "As my running example, I'm going to use the modern problem of predicting movie ratings." "So, here's a problem." "Imagine that you're a website or a company that sells or rents out movies, or what have you." "And so, you know, Amazon, and Netflix, and" "I think iTunes are all examples of companies that do this, and let's say you let your users rate different movies, using a 1 to 5 star rating." "So, users may, you know, something one, two, three, four or five stars." "In order to make this example just a little bit nicer, I'm going to allow 0 to" "5 stars as well, because that just makes some of the math come out just nicer." "Although most of these websites use the 1 to 5 star scale." "So here, I have 5 movies." "You know, Love That" "Lasts, Romance Forever, Cute Puppies of" "Love, Nonstop Car Chases, and Swords vs. Karate." "And we have 4 users, which, calling, you know, Alice, Bob, Carol, and Dave, with initials A, B," "C, and D, we'll call them users 1, 2, 3, and 4." "So, let's say Alice really likes Love That Lasts and rates that 5 stars, likes Romance" "Forever, rates it 5 stars." "She did not watch Cute Puppies of Love, and did rate it, so we don't have a rating for that, and Alice really did not like Nonstop Car Chases or" "Swords vs. Karate." "And a different user" "Bob, user two, maybe rated a different set of movies, maybe she likes to Love at Last, did not to watch Romance Forever, just have a rating of 4, a 0, a 0, and maybe our 3rd user," "rates this 0, did not watch that one, 0, 5, 5, and, you know, let's just fill in some of the numbers." "And so just to introduce a bit of notation, this notation that we'll be using throughout, I'm going to use NU to denote the number of users." "So in this example, NU will be equal to 4." "So the u-subscript stands for users and Nm, going to use to denote the number of movies, so here I have five movies so Nm equals equals 5." "And you know for this example, I have for this example, I have loosely" "3 maybe romantic or romantic comedy movies and 2 action movies and you know, if you look at this small example, it looks like Alice and Bob are giving high ratings to these romantic comedies or movies about love, and giving very" "low ratings about the action movies, and for Carol and Dave, it's the opposite, right?" "Carol and Dave, users three and four, really like the action movies and give them high ratings, but don't like the romance and love- type movies as much." "Specifically, in the recommender system problem, we are given the following data." "Our data comprises the following:" "we have these values r(i, j), and r(i, j) is 1 if user" "J has rated movie I." "So our users rate only some of the movies, and so, you know, we don't have ratings for those movies." "And whenever r(i, j) is equal to 1, whenever user j has rated movie i, we also get this number y(i, j), which is the rating given by user j to movie i." "And so, y(i, j) would be a number from zero to five, depending on the star rating, zero to five stars that user gave that particular movie." "So, the recommender system problem is given this data that has give these r(i, j)'s and the y(i, j)'s to look through the data and look at all the movie ratings that" "are missing and to try to predict what these values of the question marks should be." "In the particular example, I have a very small number of movies and a very small number of users and so most users have rated most movies but in the realistic settings your users each of your users may have rated" "only a minuscule fraction of your movies but looking at this data, you know, if Alice and Bob both like the romantic movies maybe we think that Alice would have given this a five." "Maybe we think Bob would have given this a 4.5 or some high value, as we think maybe Carol and Dave were doing these very low ratings." "And Dave, well, if Dave really likes action movies, maybe he would have given" "Swords and Karate a 4 rating or maybe a 5 rating, okay?" "And so, our job in developing a recommender system is to come up with a learning algorithm that can automatically go fill in these missing values for us so that we can look at, say, the movies that the user has" "not yet watched, and recommend new movies to that user to watch." "You try to predict what else might be interesting to a user." "So that's the formalism of the recommender system problem." "In the next video we'll start to develop a learning algorithm to address this problem." "In this video, I'd like to tell you how to write control statements for your" "Octave programs, so things like "for", "while" and "if" statements and also how to define and use functions." "Here's my Octave window." "Let me first show you how to use a "for" loop." "I'm going to start by setting v to be a 10 by" "1 vector 0." "Now, here's I write a "for" loop for I equals 1 to 10." "That's for I equals Y colon 10." "And let's see, I'm going to set V of I equals two to the power of I, and finally" "end." "The white space does not matter, so I am putting the spaces just to make it look nicely indented, but you know spacing doesn't matter." "But if I do this, then the result is that V gets set to, you know, two to the power one, two to the power two, and so on." "So this is syntax for I equals one colon 10 that makes I loop through the values one through 10." "And by the way, you can also do this by setting your indices equals one to" "10, and so the indices in the array from one to 10." "You can also write for I equals indices." "And this is actually the same as if I equals one to 10." "You can do, you know, display" "I and this would do the same thing." "So, that is a "for" loop, if you are familiar with "break"" "and "continue", there's "break" and" ""continue" statements, you can also use those inside loops in octave, but first let me show you how a while loop works." "So, here's my vector" "V. Let's write the while loop." "I equals 1, while I is less than or equal to" "5, let's set" "V I equals one hundred and increment I by one, end." "So this says what?" "I starts off equal to one and then I'm going to set V I equals one hundred and increment I by one until I is, you know, greater than five." "And as a result of that, whereas previously V was this powers of two vector." "I've now taken the first five elements of my vector and overwritten them with this value one hundred." "So that's a syntax for a while loop." "Let's do another example." "Y equals one while true and here" "I wanted to show you how to use a break statement." "Let's say V I equals 999 and I equals i+1" "if i equals 6 break and end." "And this is also our first use of an if statement, so" "I hope the logic of this makes sense." "Since I equals one and, you know, increment loop." "While repeatedly set V I equals 1 and increment i by 1, and then when 1 i gets up to 6, do a break which breaks here although the while do and so, the effective is should be to take" "the first five elements of this vector V and set them to 999." "And yes, indeed, we're taking" "V and overwritten the first five elements with 999." "So, this is the syntax for "if" statements, and for "while" statement, and notice the end." "We have two ends here." "This ends here ends the if statement and the second end here ends the while statement." "Now let me show you the more general syntax for how to use an if-else statement." "So, let's see, V 1 is equal to 999, let's" "type V1 equals to 2 for this example." "So, let me type if V 1 equals 1 display the value as one." "Here's how you write an else statement, or rather here's an else if:" "V 1 equals 2." "This is, if in case that's true in our example, display the value as 2, else" "display, the value is not one or two." "Okay, so that's a if-else if-else statement it ends." "And of course, here we've just set v 1 equals 2, so hopefully, yup, displays that the value is 2." "And finally, I don't think I talked about this earlier, but if you ever need to exit Octave, you can type the exit command and you hit enter that will cause Octave to quit or the 'q'--quits" "command also works." "Finally, let's talk about functions and how to define them and how to use them." "Here's my desktop, and I have predefined a file or pre-saved on my desktop a file called "squarethisnumber.m"." "This is how you define functions in Octave." "You create a file called, you know, with your function name and then ending in .m, and when Octave finds this file, it knows that this where it should look for the definition of the function "squarethisnumber.m"." "Let's open up this file." "Notice that I'm using the" "Microsoft program Wordpad to open up this file." "I just want to encourage you, if your using Microsoft Windows, to use Wordpad rather than" "Notepad to open up these files, if you have a different text editor that's fine too, but notepad sometimes messes up the spacing." "If you only have Notepad, that should work too, that could work too, but if you have Wordpad as well, I would rather use that or some other text editor, if you have a different text editor for editing your functions." "So, here's how you define the function in Octave." "Let me just zoom in a little bit." "And this file has just three lines in it." "The first line says function Y equals square root number of X, this tells" "Octave that I'm gonna return the value Y, I'm gonna return one value and that the value is going to be saved in the variable Y and moreover, it tells Octave that this function has one argument," "which is the argument X, and the way the function body is defined, if Y equals X squared." "So, let's try to call this function "square", this number" "5, and this actually isn't going to work, and" "Octave says square this number it's undefined." "That's because Octave doesn't know where to find this file." "So as usual, let's use PWD, or not in my directory, so let's see this c:\users\ang\desktop." "That's where my desktop is." "Oops, a little typo there." "Users ANG desktop and if I now type square root number 5, it returns the" "answer 25." "As kind of an advanced feature, this is only for those of you that know what the term search path means." "But so if you want to modify the Octave search path and you could, you just think of this next part as advanced" "or optional material." "Only for those who are either familiar with the concepts of search paths and permit languages," "but you can use the term addpath, safety colon," "slash users/ANG/desktop to add that directory to the" "Octave search path so that even if you know, go to some other directory I can still, Octave still knows to look in the users ANG desktop directory for functions so that even though I'm in a different directory now, it still" "knows where to find the square this number function." "Okay?" "But if you're not familiar with the concept of search path, don't worry about it." "Just make sure as you use the CD command to go to the directory of your function before you run it and that actually works just fine." "One concept that Octave has that many other programming languages don't is that it can also let you define functions that return multiple values or multiple arguments." "So here's an example of that." "Define the function called square and cube this number X and what this says is this function returns 2 values, y1 and y2." "When I set down, this follows, y1 is squared, y2 is execute." "And what this does is this really returns 2 numbers." "So, some of you depending on what programming language you use, if you're familiar with, you know, CC++ your offer." "Often, we think of the function as return in just one value." "But just so the syntax in Octave that should return multiple values." "Now back in the Octave window." "If" "I type, you know, a, b equals square and cube this number 5 then a is now equal to" "25 and b is equal to the cube of 5 equal to 125." "So, this is often convenient if you needed to define a function that returns multiple values." "Finally, I'm going to show you just one more sophisticated example of a function." "Let's say I have a data set that looks like this, with data points at 1, 1, 2, 2, 3, 3." "And what I'd like to do is to define an octave function to compute the cost function J of theta for different values of theta." "First let's put the data into octave." "So I set my design matrix to be 1,1,1,2,1,3." "So, this is my design matrix x with x0, the first column being the said term and the second term being you know, my the x-values of my three training examples." "And let me set y to be 1-2-3 as follows, which were the y axis values." "So let's say theta is equal to 0 semicolon 1." "Here at my desktop, I've predefined does cost function j and if I bring up the definition of that function it looks as follows." "So function j equals cost function j equals x y theta, some commons, specifying the inputs and then vary few steps set m to be the number trading examples thus the number of rows in x." "Compute the predictions, predictions equals x times theta and so this is a common that's wrapped around, so this is probably the preceding comment line." "Computer script errors by, you know, taking the difference between your predictions and the y values and taking the element of y squaring and then finally computing the cost function J. And Octave knows that J is a value I" "want to return because J appeared here in the function definition." "Feel free by the way to pause this video if you want to look at this function definition for longer and kind of make sure that you understand the different steps." "But when I run it in" "Octave, I run j equals cost function j x y theta." "It computes." "Oops, made a typo there." "It should have been capital X. It computes J equals 0 because" "if my data set was, you know, 123, 123 then setting, theta 0 equals 0, theta 1 equals" "1, this gives me exactly the 45-degree line that fits my data set perfectly." "Whereas in contrast if I set theta equals say 0, 0, then this hypothesis is predicting zeroes on everything the same, theta 0 equals 0, theta 1 equals 0 and" "I compute the cost function then it's 2.333 and that's actually equal to 1 squared," "which is my squared error on the first example, plus 2 squared," "plus 3 squared and then divided by 2m, which is" "2 times number of training examples, which is indeed 2.33 and" "so, that sanity checks that this function here is, you know, computing the correct cost function and these are the couple examples we tried out on our simple training example." "And so that sanity tracks that the cost function J," "as defined here, that it is indeed, you know, seeming to compute the correct cost function, at least on our simple training set" "that we had here with X and Y being this simple training example that we solved." "So, now you know how to right control statements like for loops, while loops and if statements in octave as well as how to define and use functions." "In the next video, I'm going to just very quickly step you through the logistics of working on and submitting problem sets for this class and how to use our submission system." "And finally, after that, in the final octave tutorial video," "I wanna tell you about vectorization, which is an idea for how to make your octave programs run much fast." "In this video, I'd like to tell you about the idea of vectorization." "So, whether you're using Octave or a similar language like MATLAB or whether you're using Python and NumPy or Java CC++." "All of these languages have either built into them or have readily and easily accessible, different numerical linear algebra libraries." "They're usually very well written, highly optimized, often so that developed by people that, you know, have PhDs in numerical computing or they are really specializing numerical computing." "And when you're implementing machine learning algorithms, if you're able to take advantage of these linear algebra libraries or these numerical linear algebra libraries and mix the routine calls to them rather than sort of right call yourself to do things that these libraries could be doing." "If you do that then often you get that "first is more efficient"." "So, just run more quickly and take better advantage of any parallel hardware your computer may have and so on." "And second, it also means that you end up with less code that you need to write." "So have a simpler implementation that is, therefore, maybe also more likely to be bug free." "And as a concrete example." "Rather than writing code yourself to multiply matrices, if you let Octave do it by typing a times b, that will use a very efficient routine to multiply the 2 matrices." "And there's a bunch of examples like these where you use appropriate vectorized implementations." "You get much simpler code, and much more efficient code." "Let's look at some examples." "Here's a usual hypothesis of linear regression and if you want to compute H of" "X, notice that there is a sum on the right." "And so one thing you could do is compute the sum from J equals 0 to J equals N yourself." "Another way to think of this is to think of h of x as theta transpose x and what you can do is think of this as you know, computing this in a product between 2 vectors where theta is, you know, your" "vector say theta 0, theta 1, theta 2 if you have 2 features." "If n equals 2 and if you think of x as this vector, x0, x1, x2" "and these 2 views can give you 2 different implementations." "Here's what I mean." "Here's an unvectorized implementation for how to compute h of x and by unvectorized I mean, without vectorization." "We might first initialize, you know, prediction to be 0.0." "This is going to eventually, the prediction is going to be h of x and then" "I'm going to have a for loop for j equals one through n+1 prediction gets incremented by theta j times xj." "So, it's kind of this expression over here." "By the way, I should mention in these vectors right over here, I had these vectors being 0 index." "So, I had theta 0 theta 1, theta 2, but because MATLAB is one index, theta 0 in MATLAB, we might end up representing as theta" "1 and this second element ends up as theta" "2 and this third element may end up as theta" "3 just because vectors in" "MATLAB are indexed starting from 1 even though our real theta and x here starting, indexing from 0, which is why here I have a for loop j goes from 1 through n+1 rather than j go through" "0 up to n, right?" "But so, this is an unvectorized implementation in that we have a for loop that summing up the n elements of the sum." "In contrast, here's how you write a vectorized implementation which is that you would think of x and theta as vectors, and you just set prediction equals theta transpose times x." "You're just computing like so." "Instead of writing all these lines of code with the for loop, you instead have one line of code and what this line of code on the right will do is it use" "Octaves highly optimized numerical linear algebra routines to compute this inner product between the two vectors, theta and X. And not only is the vectorized implementation simpler, it will also run more efficiently." "So, that was Octave, but issue of vectorization applies to other programming languages as well." "Let's look at an example in C++." "Here's what an unvectorized implementation might look like." "We again initialize prediction, you know, to 0.0 and then we now have a full loop for J0 up to" "n." "Prediction + equals theta j times x j where again, you have this x + for loop that you write yourself." "In contrast, using a good numerical linear algebra library in" "C++, you could use write the function like or rather." "In contrast, using a good numerical linear algebra library in" "C++, you can instead write code that might look like this." "So, depending on the details of your numerical linear algebra library, you might be able to have an object that is a C++ object which is vector theta and a C++ object which is a vector X," "and you just take theta dot transpose times x where this times becomes C++ to overload the operator so that you can just multiply these two vectors in C++." "And depending on, you know, the details of your numerical and linear algebra library, you might end up using a slightly different and syntax, but by relying on a library to do this in a product." "You can get a much simpler piece of code and a much more efficient one." "Let's now look at a more sophisticated example." "Just to remind you here's our update rule for gradient descent for linear regression and so, we update theta j using this rule for all values of J equals 0, 1, 2, and so on." "And if I just write out these equations for theta 0 Theta one, theta two." "Assuming we have two features." "So N equals 2." "Then these are the updates we perform to theta zero, theta one, theta two." "where you might remember my saying in an earlier video that these should be simultaneous updates." "So let's see if we can come up with a vectorized implementation of this." "Here are my same 3 equations written on a slightly smaller font and you can imagine that 1 wait to implement this three lines of code is to have a for loop that says, you know, for j equals 0," "1 through 2 the update theta J or something like that." "But instead, let's come up with a vectorized implementation and see if we can have a simpler way." "So, basically compress these three lines of code or a for loop that, you know, effectively does these 3 sets, 1 set at a time." "Let's see who can these 3 steps and compress them into" "1 line of vectorized code." "Here's the idea." "What I'm going to do is I'm going to think of theta as a vector and I'm going to update theta as theta" "minus alpha times some other vector, delta, where" "delta is going to be equal to 1 over m, sum from I equals one through m and then this term on the right, okay?" "So, let me explain what's going on here." "Here, I'm going to treat theta as a vector" "so, there's an N+1 dimensional vector." "I'm saying that theta gets, you know, updated as--that's the vector, our N+1." "Alpha is a real number and delta here is a vector." "So, this subtraction operation, that's a vector subtraction." "Okay?" "Because alpha times delta is a vector and so" "I'm saying if theta gets, you know, this vector, alpha times delta subtracted from it." "So, what is the vector delta?" "Well, this vector delta looks like this." "And what this meant to be is really meant to be" "this thing over here." "Concretely, delta will be a N+1 dimensional vector and the very first element of the vector delta is going to be equal to that." "So, if we have the delta, you know, if we index it from 0--this is delta 0, delta 1, delta 2." "What I want is that delta 0 is equal to, you know, this first box also green up above and indeed, you might be able to convince yourself that delta" "0 is this 1 of m, sum of, you know, h of x. xi minus yi times xi0." "So, let's just make sure that we're on the same page about how delta really is computed." "Delta is one of m times the sum over here" "and, you know, what is this sum?" "Well, this term over here, that's a real number." "And the second term over here, xi." "This term over there is a vector, right?" "Because xi might be a vector." "That would be xi0, xi1, xi2 right?" "And what is the summation?" "Well, what does summation say is that this term" "over here." "This is equal to h+x1-y1 times x1 + h of" "x2-y2 times x2" "+ you know, and so on." "Okay?" "Because this is a summation of the I. So, as I ranges from I1 through m, you get these different terms and you're summing up these terms." "And the meaning of each of these terms is a lot like" " if you remember actually from the earlier quiz in this, if you solve this equation." "We said that in order to vectorize this code, we will instead set u2v+5w." "So, we're saying that the vector u is equal to 2 times the vector v plus 5 times the vector w." "So, just an example of how to add different vectors and this summation is the same thing." "It's a saying that this summation over here is just some real number right?" "That's kind of like the number 2 and some other number times the vector x1." "This is like 2 times v instead with some other number times x1" "and then plus, you know, instead of 5xw, we instead have some other real number plus some other vector and then you add on other vectors, you know, plus ... plus the other vectors, which is why" "overall, this thing over here, that whole quantity, that delta is just some vector, and concretely, the" "3 elements of delta correspond if n2, the 3 elements of delta correspond exactly to this thing to the second thing and this third thing, which is why when you update theta, according to theta minus alpha delta," "we end up having exactly the same simultaneous updates as the update rules that we have on top." "So, I know that there was a lot that happened on the slides, but again, feel free to pause the video and" "I either encourage you to step through the difference." "If you're unsure of what just happen," "I encourage you to step through the slide to make sure you understand why is it that this update here with this definition of delta, right?" "Why is it that that equal to this update on top and it's still not clear when insight is that, you know, this thing over here." "That's exactly the vector x and so, we're just taking, you know, all" "3 of these computations and compressing them into one step with the this vector delta, which is why we can come up with a vectorized implementation of this step of linear regression this way." "So I hope this step makes sense, and do look at the video and make sure and see if you can understand it." "In case you don't understand The equivalence of this math if you implement this, this turns out to be the right answer anyway, so even if you didn't quite understand the equivalence, if you just implement it this way," "you'll be able to get linear regressions to work." "So, if you're able to figure out why these 2 steps are equivalent then hopefully that would give you a better understanding of vectorization as well, and finally, if you're implementing linear regression using more than one or two features." "So, sometimes we use linear regression with tens or hundreds thousands of features, but if you use the vectorized implementation of linear regression, usually that will run much faster than if you had say your old for loop that was you" "know, updating theta 0 then theta 1 then theta 2 yourself." "So, using a vectorized implementation, you should be able to get a much more efficient implementation of linear regression." "And when you vectorize later algorithms that we'll see in this class is a good trick whether an octave or some of the language, the C++" "Java for getting your code to run more efficiently." "In this video, I want to just quickly step you through the logistics of how to work on homeworks in this class and how to use the submission system which will let you verify right away that you got the right answer for your machine learning program exercise." "Here's my Octave window and let's first go to my desktop." "I saved the files for my first exercise, some of the files on my desktop:" "in this directory, 'ml-class-ex1'." "And we provide a number files and ask you to edit some of them." "So the first file should meet the details in the pdf file for this programming exercise." "But one of the files we ask you to edit is this file called warmUpExercise.m, where the exercise is really just to make sure that you're familiar with the submission system." "And all you need to do is return the 5x5 identity matrix." "So the solution to this exercise I just showed you is to write A = eye(5)." "So that modifies this function to generate the 5x5 identity matrix." "And this function warmUpExercise() now returns the 5x5 identity matrix." "And I'm just going to save it." "So I've done the first part of this homework." "Going back to my Octave window, let's now go to my directory, 'C:\Users\ang\Desktop\ml-class-ex1'." "And if I want to make sure that I've implemented this, type 'warmUpExercise()' like so." "And yup, it returns the 5x5 identity matrix that we just wrote the code to create." "And I can now submit the code as follows." "I'm going to type 'submit()' in this directory and I'm ready to submit part 1 so I'm going to enter choice '1'." "So it asks me for my email address." "I'm going go to the course website." "This is an internal testing site, so your version of the website may look a little bit different." "But that's my email address and this is my submission password, and I'm just going to type them in here." "So I have ang@cs.stanford.edu and my submission password is 9yC75USsGf." "I'm going to hit enter; it connects to the server and submits it, and right away it tells you "Congratulations!" "You have successfully completed Homework 1 Part 1"." "And this gives you a verification that you got this part right." "And if you don't submit the right answer, then it will give you a message indicating that you haven't quite gotten it right yet." "And you can use this submission password and you can generate new passwords; it doesn't matter." "But you can also use your regular website login password, but because this password here is typed in clear text on your monitor, we gave you this extra submission password in case you don't want to type in your website's normal password onto a window" "that, depending on your operating system, may or may not appear as text when you type it into the Octave submission script." "So, that's how you submit the homeworks after you've done it." "Good luck, and, when you get around to homeworks, I hope you get all of them right." "And finally, in the next and final Octave tutorial video, I want to tell you about vectorization, which is a way to get your Octave code to run much more efficiently." "In this and the next few videos, I want to start to talk about classification problems, where the variable y that you want to predict is discreet valued." "We'll develop an algorithm called logistic regression, which is one of the most popular and most widely used learning algorithms today." "Here are some examples of classification problems." "Earlier, we talked about emails, spam classification as an example of a classification problem." "Another example would be classifying online transactions." "So, if you have a website that sells stuff and if you want to know if a physical transaction is fraudulent or not, whether someone has, you know, is using a stolen credit card or has stolen the user's password." "That's another classification problem, and earlier we also talked about the example of classifying tumors" "as a cancerous malignant or as benign tumors." "In all of these problems, the variable that we're trying to predict is a variable" "Y that we can think of as taking on two values, either zero or one, either a spam or not spam, fraudulent or not fraudulent, malignant or benign." "Another name for the class that we denote with 0 is the negative class, and another name for the class that we denote with 1 is the positive class." "So 0 may denote the benign tumor and 1 positive class may denote a malignant tumor." "The assignment of the 2 classes, you know, spam, no spam, and so on - the assignment of the 2 classes to positive and negative, to 0 and 1 is somewhat arbitrary and it doesn't really matter." "But often there is this intuition that the negative class is conveying the absence of something, like the absence of a malignant tumor, whereas one, the positive class, is conveying the presence of something that we may be looking for." "But the definition of which is negative and which is positive is somewhat arbitrary and it doesn't matter that much." "For now, we're going to start with classification problems with just two classes; zero and one." "Later on, we'll talk about multi-class problems as well, whether variable" "Y may take on say, for value zero, one, two and three." "This is called a multi-class classification problem, but for the next few videos, let's start with the two class or the binary classification problem." "and we'll worry about the multi-class setting later." "So, how do we develop a classification algorithm?" "Here's an example of a training set for a classification" "task for classifying a tumor as malignant or benign and notice that malignancy takes on only two values zero or no or one or one or yes." "So, one thing we could do given this training set is to apply the algorithm that we already know, linear regression to this data set and just try to fit the straight line to the data." "So, if you take this training set and fill a straight line to it, maybe you get hypothesis that looks like that." "Alright, so that's my hypothesis, h of x equals theta transpose x." "If you want to make predictions, one thing you could try doing is then threshold the classifier outputs at 0.5." "That is at the vertical access value 0.5." "And if the hypothesis outputs a value that's greater than equal to 0.5 you predict y equals one." "If it's less than 0.5, you predict y equals zero." "Let's see what happens when we do that." "So, let's take 0.5, and so, you know, that's where the threshold is." "And thus, using linear regression this way." "Everything to the right of this point, we will end up predicting as the positive class because of the output values are greater than 0.5 on the vertical axis and everything to the left of that point we will end" "up predicting as a negative value." "In this particular example, it looks like linear regression is actually doing something reasonable even though this is a classification task we're interested in." "But now let's try changing problem a bit." "Let me extend out the horizontal axis of orbit and let's say we got one more training example way out there on the right." "Notice that that additional training example, this one out here, it doesn't actually change anything, right?" "Looking at the training set, it is pretty clear what a good hypothesis is." "Well, everything to the right of somewhere around here to the right of this we should predict as positive, and everything to the left we should probably predict as negative because from this training set it looks like all the tumors larger than, you" "know, a certain value around here are malignant, and all the tumors smaller than that are not malignant, at least for this training set." "But once we've added that extra example out here, if you now run linear regression, you instead get a straight line fit to the data." "That might maybe look like this, and if you now threshold this hypothesis" "at 0.5, you end up with a threshold that's around here so that everything to the right of this point you predict as positive, and everything to the left of that point you predict as negative." "And this seems a pretty bad thing for linear regression to have done, right?" "Because, you know, these are our positive examples, these are our negative examples." "It's pretty clear, we should really be separating the two classes somewhere around there, but somehow by adding one example way out here to the right, this example really isn't giving us any new information." "I mean, it should be no surprise to the learning out of that the example way out here turns out to be malignant." "But somehow adding that example out there caused linear regression to change in straight line fit to the data from this" "magenta line out here to this blue line over here, and caused it to give us a worse hypothesis." "So, applying linear regression to a classification problem usually isn't, often isn't a great idea." "In the first instance, in the first example before I added this extra training example," "previously linear regression was just getting lucky and it got us a hypothesis that, you know, worked well for that particular example, but usually apply linear regression to a data set, you know, you might get lucky but" "often it isn't a good idea, so I wouldn't use linear regression for classification problems." "Here is one other funny thing about what would happen if we were to use linear regression for a classification problem." "For classification, we know that" "Y is either zero or one, but if you are using linear regression, well the hypothesis" "can output values much larger than one or less than zero, even if all of good the training examples have labels" "Y equals zero or one, and it seems kind of strange that even though we know that the label should be zero one, it seems kind of strange if the algorithm can offer values much larger than one or much smaller than zero." "So what we'll do in the next few videos is develop an algorithm called logistic regression which has the property that the output, the predictions of logistic regression are always between zero and one, and doesn't become" "bigger than one or become less than zero and by the way, logistic regression is and we will use it as a classification algorithm in some, maybe sometimes confusing that the term regression appears in his name, even though logistic regression" "is actually a classification algorithm." "But that's just the name it was given for historical reasons so don't be confused by that." "Logistic Regression is actually a classification algorithm that we apply to settings where the label Y is discreet valued." "The 1001." "So hopefully you now know why if you have a causation problem using linear regression isn't a good idea ." "In the next video we'll start working out the details of the logistic regression algorithm." "Let's start talking about logistic regression." "In this video, I'd like to show you the hypothesis representation, that is, what is the function we're going to use to represent our hypothesis where we have a classification problem." "Earlier, we said that we would like our classifier to" "output values that are between zero and one." "So, we like to come up with a hypothesis that satisfies this property, that these predictions are maybe between zero and one." "When we were using linear regression, this was the form of a hypothesis, where H of X is theta transpose X. For logistic regression, I'm going to modify this a little bit, and make the hypothesis" "G of theta transpose X, where I'm going to define the function G as follows:" "G of Z if Z is a real number is equal to one over one plus" "E to the negative Z. This called the sigmoid function" "or the logistic function." "And the term logistic function, that's what give rise to the name logistic progression." "And, by the way, the terms sigmoid function and logistic function are basically synonyms and mean the same thing." "So the two terms are basically interchangeable and either term can be used to refer to this function" "G. And if we take these two equations, and put them together, then here's just an alternative way of writing out the form of my hypothesis." "I'm saying that H of x is one over one plus" "E to the negative theta transpose" "X, and all I've done is" "I've taken the variable" "Z, Z here's a real number and plugged in theta transpose X, so" "I end up with, you know, theta transpose" "X, in place of Z there." "Lastly, let me show you where the sigmoid function looks like." "We're going to plot it on this figure here." "The sigmoid function, G of" "Z, also called the logistic function, looks like this." "It starts off near zero and then rises until it processes" "0.5 at the origin and then it flattens out again like so." "So that's what the sigmoid function looks like." "And you notice that the sigmoid function, well, it asymptotes at one, and asymptotes at zero as" "Z against the horizontal axis is Z. As Z goes to minus infinity, G of" "Z approaches zero and as" "G of Z approaches infinity, G of Z approaches 1, and so because G of" "Z offers values that are between 0 and 1 we" "also have that H of" "X must be between 0 and 1." "Finally, given this hypothesis representation, what we need to do, as before, is fit the parameters theta to our data." "So given a training set, we need to pick a value for the parameters theta and this hypothesis will then let us make predictions." "We'll talk about a learning algorithm later for fitting the parameters theta." "But first let's talk a bit about the interpretation of this model." "Here's how I'm going to interpret the output of my hypothesis H of" "X. When my hypothesis outputs some number, I am going to treat that number as the estimated probability that Y is equal to one on a new input example X. Here is what I mean." "Here is an example." "Let's say we're using the tumor classification example." "So we may have a feature vector X, which is this x01 as always and then our one feature is the size of the tumor." "Suppose I have a patient come in and, you know they have some tumor size and I feed their feature vector X into my hypothesis and suppose my hypothesis outputs the number 0.7." "I'm going to interpret my hypothesis as follows." "I'm going to say that this hypothesis is telling me that for a patient with features X, the probability that Y equals one is 0 .7." "In other words, I'm going to tell my patient that the tumor, sadly, has a 70% chance or a 0.7 chance of being malignant." "To write this out slightly more formally or to write this out in math, I'm going to interpret my hypothesis output as P of y equals 1, given X" "parametrized by theta." "So, for those of you that are familiar with probability, this equation might make sense, if you're a little less familiar with probability, you know, here's how I read this expression, this is the probability that y is" "equals to one, given x instead of given that my patient has, you know, features X." "Given my patient has a particular tumor size represented by my features X, and this probability is parametrized by theta." "So I'm basically going to count on my hypothesis to give me estimates of the probability" "that Y is equal to 1." "Now since this is a classification task, we know that Y must be either zero or one, right?" "Those are the only two values that Y could possibly take on, either in the training set or for new patients that may walk into my office or into the doctor's office in the future." "So given H of X, we can therefore compute the probability" "that Y is equal to zero as well." "Concretely, because Y must be either zero or one, we know that the probability of Y equals zero, plus the probability of Y equals one, must add up to one." "This first equation looks a little bit more complicated but it's basically saying that probability of" "Y equals zero for a particular patient with features x, and you know, given our parameter's data, plus the probability of Y equals one for that same patient which features x and you parameters theta must add" "up to one, if this equation looks a little bit complicated feel free to mentally imagine it without that X and theta." "And this is just saying that the probability of Y equals zero plus the probability of Y equals one must be equal to one." "And we know this to be true because Y has to be either zero or one." "And so the chance of Y being zero plus the chance that Y is one, you know, those two must add up to one." "And so if you just take this term and move it to the right-hand side, then you end up with this equation that says probability Y equals zero is one minus probability y equals and thus if our hypothesis if H of X" "gives us that term you can therefore quite simply compute the probability, or compute the estimated probability that Y is equal to zero as well." "So you now know what the hypothesis representation is for logistic regression and we're seeing what the mathematical formula is defining the hypothesis for logistic regression." "In the next video, I'd like to try to give you better intuition about what the hypothesis function looks like." "And I want to tell you something called the decision boundary and we'll look at some visualizations together to try to get a better sense of what this hypothesis function of logistic regression really looks like." "In the last video, we talked about the hypothesis representation for logistic progression." "What I'd like to do now is tell you about something called the decision boundary, and this will give us a better sense of what the logistic regression hypothesis function is computing." "To recap, this is what we wrote out last time, where we said that the hypothesis is represented as" "H of X equals G of theta transpose X, where G is this function called the sigmoid function, which looks like this." "So, it slowly increases from zero to one, asymptoting at one." "What I want to do now is try to understand better when this hypothesis will make predictions that Y is equal to one versus when it might make predictions that Y is equal to zero and understand better what the hypothesis function" "looks like, particularly when we have more than one feature." "Concretely, this hypothesis is out putting estimates of the probability that Y is equal to one given X is prime." "So if we wanted to predict is Y equal to one or is Y equal to zero here's something we might do." "When ever the hypothesis its that the problem with y being one is greater than or equal to 0.5 so this means that it is more likely to be y equals one than y equals zero then let's predict Y equals one." "And otherwise, if the probability of, the estimated probability of" "Y being one is less than 0.5, then let's predict Y equals zero." "And I chose a greater than or equal to here and less than here." "If H of X is equal to 0.5 exactly, then we could predict positive or negative vector but a put a great deal on to here so we default maybe to predicting a positive if your vector is 0.5 but that's" "a detail that really doesn't matter that much." "What I want to do is understand better when it is exactly that H of" "X will be greater or equal to 0.5, so that we end up predicting Y is equal to one." "If we look at this plot of the sigmoid function, we'll notice that the sigmoid function, G" "of Z, is greater than or equal to 0.5 whenever Z is" "greater than or equal to zero." "So is in this half of the figure that, G takes on values that are 0.5 and higher." "This is not clear, that's the 0.5." "So when Z is positive, G of Z, the sigmoid function, is greater than or equal to 0.5." "Since the hypothesis for logistic regression is H of" "X equals G of theta transpose X. This is therefore going to be greater than or equal to 0.5 whenever theta transpose" "X is greater than or equal to zero." "So what was shown, right, because here theta transpose X takes the row of Z." "So what we're shown is that our hypothesis is going to predict Y equals one whenever theta transpose X is greater than or equal to 0." "Let's now consider the other case of when a hypothesis will predict Y is equal to 0." "Well, by similar argument, H of X is going to be less than 0.5 whenever G of Z is less than" "0.5 because the range of values of Z that calls Z to take on values less that 0.5, well that's when Z is negative." "So when G of Z is less than 0.5." "Our hypothesis will predict that Y is equal to zero, and by similar argument to what we had earlier, H of" "X is equal G of theta transpose X. And so, we'll predict Y equals zero whenever this quantity theta transpose X is less than zero." "To summarize what we just worked out, we saw that if we decide to predict whether" "Y is equal to one or" "Y is equal to zero, depending on whether the estimated probability is greater than or equal 0.5, or whether it's less than 0.5, then that's the same as saying that will predict Y equals 1 whenever theta transpose axis greater" "than or equal to 0, and we will predict Y is equal to zero whenever theta transpose X is less than zero." "Let's use this to better understand how the hypothesis of logistic regression makes those predictions." "Now, let's suppose we have a training set like that shown on the slide, and suppose our hypothesis is H of" "X equals G of theta zero, plus theta one X1 plus theta two X2." "We haven't talked yet about how to fit the parameters of this model." "We'll talk about that in the next video." "But suppose that variable procedure to be specified, we end up choosing the following values for the parameters." "Let's say we choose theta zero equals three, theta one equals one, theta two equals one." "So this means that my parameter vector is going to be theta equals minus 311." "So, we're given this choice of my hypothesis parameters," "let's try to figure out where a hypothesis will end up predicting y equals 1 and where it will end up predicting y equals 0." "Using the formulas that we worked on the previous slide, we know that Y equals 1 is more likely, that is the probability that Y equals" "1 is greater than 0.5 or greater than or equal to 0.5." "Whenever theta transpose x is greater than zero." "And this formula that I just underlined minus three plus X1 plus X2 is, of course, theta transpose" "X when theta is equal to this value of the parameters that we just chose." "So, for any example, for any example with features X1 and X2 that satisfy this equation that minus 3 plus X1 plus X2" "is greater than or equal to 0." "Our hypothesis will think that Y equals 1 is more likely, or will predict that Y is equal to one." "We can also take minus three and bring this to the right and rewrite this as X1 plus X2 is greater than or equal to three." "And so, equivalently, we found that this hypothesis will predict" "Y equals one whenever X1 plus X2 is greater than or equal to three." "Let's see what that means on the figure." "If I write down the equation," "X1 plus X2 equals three, this defines the equation of a straight line." "And if I draw what that straight line looks like, it gives me the following line which passes through 3 and 3 on the X1 and the X2 axis." "So the part of the input space, the part of the" "X1, X2 plane that corresponds to when X1 plus X2 is greater than or equal to three." "That's going to be this very top plane." "That is everything to the up, and everything to the upper right portion of this magenta line that I just drew." "And so, the region where our hypothesis will predict Y equals 1 is this region, you know, is really this huge region, this half-space over to the upper right." "And let me just write that down." "I'm gonna call this the Y equals one region, and in contrast the region where" "X1 plus X2 is less than three, that's when we will predict that Y," "Y is equal to zero, and that corresponds to this region." "You know, itt's really a half-plane, but that region on the left is the region where our hypothesis predict Y equals 0." "I want to give this line, this magenta line that I drew a name." "This line there is called the decision boundary." "And concretely, this straight line" "X1 plus X equals 3." "That corresponds to the set of points." "So that corresponds to the region where H of X is equal to 0.5 exactly and the decision boundary, that is this straight line, that's the line that separates the region where the hypothesis predicts Y equals one from the region" "where the hypothesis predicts that Y is equal to 0." "And just to be clear." "The decision boundary is a property of the hypothesis" "including the parameters theta 0, theta 1, theta 2." "And in the figure I drew a training set." "I drew a data set in order to help the visualization." "But even if we take away the data set, you know decision boundary and a region where we predict Y equals 1 versus Y equals zero." "That's a property of the hypothesis and of the parameters of the hypothesis, and not a property of the data set." "Later on, of course, we'll talk about how to fit the parameters and there we'll end up using the training set, or using our data, to determine the value of the parameters." "But once we have particular values for the parameters: theta 0, theta 1, theta 2." "Then that completely defines the decision boundary and we don't actually need to plot a training set in order to plot the decision boundary." "Let's now look at a more complex example where, as usual, I have crosses to denote my positive examples and" "O's to denote my negative examples." "Given a training set like this, how can I get logistic regression to fit this sort of data?" "Earlier, when we were talking about polynomial regression or when we're linear regression, we talked about how we can add extra higher order polynomial terms to the features." "And we can do the same for logistic regression." "Concretely, let's say my hypothesis looks like this." "Where I've added two extra features, X1 squared and X2 squared, to my features." "So that I now have 5 parameters, theta 0 through theta 4." "As before, we'll defer to the next video our discussion on how to automatically choose values for the parameters theta 0 through theta 4." "But let's say that very procedure to be specified," "I end up choosing theta 0 equals minus 1, theta 1 equals 0, theta 2 equals 0, theta 3 equals" "1, and theta 4 equals 1." "What this means is that with this particular choice of parameters, my parameter vector theta looks like minus 1, 0, 0, 1, 1." "Following our earlier discussion, this means that my hypothesis will predict that Y is equal to 1 whenever minus 1 plus X1 squared plus X2 squared is greater than or equal to 0." "This is whenever theta transpose times my theta transpose my features is greater than or equal to 0." "And if I take minus 1 and just bring this to the right, I'm saying that my hypothesis will predict that" "Y is equal to 1 whenever X1 squared plus" "X2 squared is greater than or equal to 1." "So, what does decision boundary look like?" "Well, if you were to plot the curve for X1 squared plus" "X2 squared equals 1." "Some of you will that is the equation for a circle of radius" "1 centered around the origin." "So, that is my decision boundary." "And everything outside the circle I'm going to predict as Y equals 1." "So out here is, you know, my" "Y equals 1 region." "I'm going to predict Y equals 1 out here." "And inside the circle is where" "I'll predict Y is equal to 0." "So, by adding these more complex or these polynomial terms to my features as well." "I can get more complex decision boundaries that don't just try to separate the positive and negative examples of a straight line." "I can get in this example a decision boundary that's a circle." "Once again the decision boundary is a property not of the training set, but of the hypothesis and of the parameters." "So long as we've given my parameter vector theta, that defines the decision boundary which is the circle." "But the training set is not what we use to define decision boundary." "The training set may be used to fit the parameters theta." "We'll talk about how to do that later." "But once you have the parameters theta, that is what defines the decision boundary." "Let me put back the training set just for visualization." "And finally, let's look at a more complex example." "So can we come up with even more complex decision boundaries and this?" "If I have even higher order polynomial terms, so things like X1 squared, X1 squared X2, X1 squared" "X2 squared, and so on." "If I have much higher order polynomials, then it's possible to show that you can get even more complex decision boundaries and logistic regression can be used to find the zero boundaries that may, for example, be" "an ellipse like that, or maybe with a different setting of the parameters, maybe you can get instead a different decision boundary that may even look like, you know, some funny shape like that." "Or for even more complex examples you can also get decision boundaries" "that can look like, you know, more complex shapes like that." "Where everything in here you predict Y equals 1, and everything outside you predict Y equals 0." "So these higher order polynomial features you can get very complex decision boundaries." "So with these visualizations, I hope that gives you a what's the range of hypothesis functions we can represent using the representation that we have for logistic regression." "Now that we know what H of X can represent." "What I'd like to do next in the following video is talk about how to automatically choose the parameters theta." "So that given a training set we can automatically fit the parameters to our data." "In this video we'll talk about how to fit the parameters theta for logistic regression." "In particular, I'd like to define the optimization objective or the cost function that we'll use to fit the parameters." "Here's to supervised learning problem of fitting a logistic regression model." "We have a training set of M training examples." "And as usual each of our examples is represented by feature vector that's N plus 1 dimensional." "And as usual we have" "X 0 equals 1." "Our first feature, or our 0 feature is always equal to 1, and because this is a classification problem, our training set has the property that every label Y, is either 0 or 1." "This is a hypothesis and the parameters of the hypothesis is this theta over here." "And the question I want to talk about is given this training set how do we choose, or how do we fit the parameters theta?" "Back when we were developing the linear regression model, we use the following cost function." "I've written this slightly differently, where instead of 1/2m, I've taken the 1/2 and put it inside the summation instead." "Now, I want to use an alternative way of writing out this cost function which is that instead of writing out this squared and return here, let's write here, cost of" "H of X comma" "Y, and I'm going to define that term cost" "of H of X comma Y to be equal to this." "It's just equal to just one half of the square root error." "So now, we can see more clearly that the cost function is a sum over my training set, or is 1/m times the sum over my training set of this cost term here." "And to simplify this equation a little bit more, it's gonna be convenient to get rid of those superscripts." "So just define cost of" "H of X comma Y to be equal to 1/2 of this square root error and the interpretation of this cost function is that this is the cost I want my learning algorithm to, you know, have to pay, if it" "outputs that value it this prediction is H of" "X, and the actual label was Y. So just cross off those superscripts." "All right." "And no surprise for linear regression the cost for you to define is that." "Well the cost for this is, that is 1/2 times the square difference between what are predicted and the actual value that we observe for Y. Now, this cost function worked fine for linear regression, but here we're interested in logistic regression." "If we could minimize this cost function that is plugged into J here." "That will work okay." "But it turns out that if we use this particular cost function this would be a non-convex function of the parameters theta." "Here's what I mean by non-convex." "We have some cost function J of theta, and for logistic regression this function H here" "has a non linearity, right?" "It says, you know, 1 over 1 plus E to the negative theta transfers" "X. So it's a pretty complicated nonlinear function." "And if you take the sigmoid function and plug it in here and then take this cost function and plug it in there, and then plot what J of theta looks like, you find that" "J of theta can look like a function just like this." "You know with many local optima and the formal term for this is that this a non convex function." "And you can kind of tell." "If you were to run gradient descent on this sort of function, it is not guaranteed to converge to the global minimum." "Whereas in contrast, what we would like is to have a cost function J of theta that is convex, that is a single bow-shaped function that looks like this, so that if you run gradient descent, we" "would be guaranteed that gradient descent, you know, would converge to the global minimum." "And the problem of using the the square cost function is that because of this very non linear sigmoid function that appears in the middle here, J of theta ends up being a non convex function if you were to define it as the square cost function." "So what we'd would like to do is to instead come up with a different cost function that is convex and so that we can apply a great algorithm like gradient descent and be guaranteed to find a global minimum." "Here's a cost function that we're going to use for logistic regression." "We're going to say the cost or the penalty that the algorithm pays if it outputs a value H of X." "So, this is some number like 0.7 where it predicts a value H of X. And the actual cost label turns out to be Y. The cost is going to be minus log" "H of X if Y is equal 1." "And minus log, 1 minus" "H of X if Y is equal to 0." "This looks like a pretty complicated function." "But let's plot function to gain some intuition about what it's doing." "Let's start up with the case of Y equals 1." "If Y is equal equal to 1, then the cost function" "is -log H of X, and if we plot that, so let's say that the horizontal axis is H of X." "So we know that a hypothesis is going to output a value between" "0 and 1." "Right?" "So H of X that varies between 0 and 1." "If you plot what this cost function looks like." "You find that it looks like this." "One way to see why the plot like this it is because" "if you were to plot log Z with Z on the horizontal axis." "Then that looks like that." "And it's approach is minus infinity." "So this is what the log function looks like." "And so this is 0, this is 1." "Here Z is of course playing the row of" "H of X. And so minus log Z will look like this." "Right just flipping the sign." "Minus log Z. And we're interested only in the range of when this function goes between 0 and 1." "So, get rid of that." "And so, we're just left with, you know, this part of the curve." "And that's what this curve on the left looks like." "Now this cost function has a few interesting and desirable properties." "First you notice that if" "Y is equal to 1 and H of X is equal 1, in other words, if the hypothesis exactly, you know, predicts" "H equals 1, and Y is exactly equal to what I predicted." "Then the cost is equal 0." "Right?" "That corresponds to, the curve doesn't actually flatten out." "The curve is still going." "First, notice that if H of X equals 1, if the hypothesis predicts that Y is equal to 1." "And if indeed Y is equal to 1 then the cost is equal to 0." "That corresponds to this point down here." "Right?" "If H of X is equal to 1, and we're only concerned the case that Y equals 1 here." "But if H of X is equal to 1." "Then the cost is down here is equal to 0." "And that is what we like it to be." "Because, you know, if we correctly predict the output Y then the cost is 0." "But now, notice that H of X approaches 0." "So, that's H. As the output of the hypothesis approaches 0 the cost blows up, and it goes to infinity." "And what this does is it captures the intuition that if a hypothesis, you know, outputs 0." "That's like saying, our hypothesis is saying, the chance of Y equals 1 is equal to 0." "It's kind of like our going to our medical patient and saying," ""The probability that you have a malignant tumor, the probability that Y equals 1 is zero."" "So, it's like absolutely impossible that your tumor is malignant." "But if it turns out that the tumor, the patient's tumor, actually is malignant." "So if Y is equal to 1 even after we told them you know, the probability of it happening is 0." "It's absolutely impossible for it to be malignant." "But if we told them this with that level of certainty, and we turn out to be wrong, then we penalize the learning algorithm by a very, very large cost, and that's captured by having this" "cost goes infinity if Y equals 1 and H of X approaches 0." "This might consider of" "Y1, let's look at what the cost function looks like for Y0." "If Y is equal to 0, then the cost looks like this expression over here." "And if you plot the function minus log 1 minus Z what you get is the cost function actually looks like this." "So, it goes from 0 to 1." "Something like that." "And so if you plot the cost function for the case of y equals zero, you find that it looks like this and what this curve does is it now blows up, and it goes to plus infinity as H of X goes to 1." "Because it's saying that if Y turns out to be equal to 0, but we predicted that you know, Y is equal to 1 with almost certainty with probability 1, then we end up paying a very large cost." "Let's plot the cost function for the case of Y equals 0." "So if Y equals 0 that's going to be our cost function." "If you look at this expression, and if you plot, you know, minus log 1 minus Z, if you figure out what that looks like, you get a figure that looks like this." "Where, which goes from 0 to 1 with the Z axis on the horizontal axis." "So If you take this cost function and plot it for the case of Y equals 0, what you get is that the cost function looks like this." "And what this cost function does is that it blows up or it goes to a positive infinity as each" "H of X goes to one and this captures the intuition that if a hypothesis predicted that, you know, H of" "X is equal to 1 with certainty, with like probability 1, it's absolutely got to be Y equals 1." "But if Y turned out to be equal to 0 then it makes sense to make the hypothesis, or make the learning algorithm pay a very large cost." "And conversely, if H of X is equal to" "0 and Y equals zero, then the hypothesis nailed it." "The predicted Y is equal to zero and it turns out Y is equal to zero so at this point the cost function is going to be 0." "In this video, we have defined the cost function for a single training example." "The topic of convexity analysis is beyond the scope of this course." "But it is possible to show that with our particular choice of cost function this would give us a convex optimization problem" "as cost function, overall cost function" "J of theta will be convex and local optima free." "In the next video we're going to take these ideas of the cost function for a single training example and develop that further and define the cost function for the entire training set, and we'll also figure out a simpler way to" "write it than we have been using so far." "And based on that will work out gradient descent, and that will give us our logistic regression algorithm." "In this video, we'll figure out a slightly simpler way to write the cost function than we have been using so far." "And we'll also figure out how to apply gradient descent to fit the parameters of logistic regression." "So by the end of this video you know how to implement a fully working version of logistic regression." "Here's our cost function for logistic regression." "Our overall cost function is 1 over M times sum of the training set of the cost of making different predictions on the different examples of labels Y I. And this is a cost for a single example that we worked out earlier." "And I just want to remind you that for classification problems in our training and in fact even for examples known in our training set, Y is always equal to 0 or 1." "Right?" "That's sort of part of the mathematical definition of Y." "Because Y is either 0 or 1." "We'll be able to come up with a simpler way to write this cost function." "And in particular, rather than writing out this cost function on two separate lines with two separate cases for Y equals 1 and Y equals" "0, I am going to show you a way take these two lines and compress them into one equation." "And this will make it more convenient to write out the cost function and derive gradient descent." "Concretely, we can write out the cost function as follows." "We'll say the cost of H of X comma Y. I'm going to write this as minus Y times log H of" "X minus 1 minus Y times log 1 minus H of X." "And I'll show you in a second that this expression, or this equation is an equivalent way or more compact way of writing out this definition of the cost function that we had up here." "Let's see why that's the case." "We know that there are only 2 possible cases." "Y must be 0 or 1." "So let's suppose Y equals 1." "If Y is equal to 1 then this equation is saying that the cost" "is equal to." "Well if Y is equal to one, then this thing here is equal to one." "And one minus Y is going to be equal to zero, right?" "So if Y is equal to one, then one minus Y is one minus one, which is therefore zero." "So the second term gets multiplied by zero and goes away, and we're left with only this first term which is Y times log, minus Y times log H of X. Y is" "1 so that's equal to minus log H of X." "And this equation is exactly what we have up here for if Y is equal to one." "The other case is if" "Y is equal to 0." "And if that is the case then, writing of the cost function is saying that if Y is equal to zero, then this term here, will be equal to zero." "Whereas 1 minus Y, if" "Y equals zero, would be equal to 1, because 1 minus" "Y becomes 1 minus 0, which is just equal to 1." "And so the cost function simplifies to just this last term here." "Right?" "Because the first term over here gets multiplied by zero, and so it disappears." "So we're just left with this last term, which is minus log, 1 minus H of" "X. And you can verify that this term here is just exactly what we had for when Y is equal to 0." "So this shows that this definition for the cost is just a more compact way of taking both of these expressions, the cases Y equals 1 and" "Y equals 0, and writing them in one, in a more convenient form with just one line." "We can, therefore, write all of our cost function for logistic regression as follows." "It is this one of m of the sum of this cost functions, and plugging in the definition for the cost that we worked out earlier, we end up with this." "And we just brought the minus sign outside." "And why do we choose this particular cost function?" "When it looks like there could be other cost functions that we could have chosen." "Although I won't have time to go into great detail of this in this course, this cost function can be derived from statistics using the principle of maximum likelihood estimation, which is an idea statistics for how to efficiently find parameters data for different models." "And it also has a nice property that it is convex." "So this is the cost function that, you know, essentially everyone uses when putting Logistic Regression models." "If we don't understand the terms" "I just say and you don't know what the principle of maximum likelihood estimation is, don't worry about." "There's just a deeper rational and justification behind this particular cost function then I have time to go into in this class." "Given this cost function, in order to fit the parameters," "what we're going to do then is try to find the parameters theta that minimizes J of theta." "So if we, you know, try to minimize this." "This would give us some set of parameters theta." "Finally, if we're given a new example with some set of features X. We can then take the thetas that we fit our training set and output our prediction as this, and just to remind you the output" "of my hypothesis, I am going to interpret as the probability that Y is equal to 1." "And this is given the implement X and parameters by theta." "But think of this as just my hypothesis is estimating the probability that Y is equal to 1." "So all that remains to be done is figure out how to actually minimize J of theta as a function of theta so we can actually fit the parameters to our training set." "The way we're going to minimize the cost function is using gradient descent." "Here's our cost function." "And if we want to minimize it as a function of theta." "Here's our usual template for gradient descent." "Where we repeatedly update each parameter by taking updating it as itself minus a learning rate alpha times this derivative term." "If you know some calculus feel free to take this term and try to compute a derivative yourself and see if you can simplify it to the same answer that I get." "But even if you don't know calculus don't worry about it." "If you actually compute this, what you get is this equation." "And just write it out here." "The sum from I equals 1 through M of the, essentially the error, times" "X I J. So if you take this partial derivative term and plug it back in here, we can then write out our gradient descent algorithm as follows." "And all I've done is I took the derivative term from the previous line and plugged it in there." "So if you have N features, you would have, you know, a parameter vector theta, which parameters theta zero, theta one, theta two, down to theta" "N and you will use this update to simultaneously update all of your values of theta." "Now if you take this update rule and compare it to what we were doing for linear regression, you might be surprised to realize that, well, this equation was exactly what we had for linear regression." "In fact, if you look at the earlier videos and look at the update rule, the gradient descent rule for linear regression, it looked exactly like what I drew here inside the blue box." "So are linear regression and logistic regression different algorithms or not?" "Well, this is resolved by observing that for logistic regression, what has changed is that the definition for this hypothesis has changed." "So whereas for linear regression we had H of X equals theta transpose X, now the definition of H of" "X has changed and is instead now 1 over 1 plus e to the negative theta transpose X." "So even though the update rule looks cosmetically identical, because the definition of the hypothesis has changed, this is actually not the same thing as gradient descent for linear regression." "In an earlier video, when we were talking about gradient descent for linear regression, we had talked about how to monitor gradient descent to make sure that it is converging." "I usually apply that same method to logistic regression too to monitor gradient descent to make sure it's conversion correctly." "And hopefully you can figure out how to apply that technique to logistic regression yourself." "When implementing logistic regression with gradient descent, we have all of these different parameter values, you know, theta" "0 down to theta N that we need to update using this expression." "And one thing we could do is have a for loop." "So for I equals 0 to" "N of i equals 1 to N plus 1." "So update each of these parameter values in turn." "But of course, rather than using a folder, ideally we would also use a vectorized implementation." "And so that a vectorized implementation can update, you know, all of these N plus" "1 parameters all in one fell swoop." "And to check your own understanding, you might see if you can figure out how to do the vectorized implementation of this algorithm as well." "So now you know how to implement gradient descents for logistic aggression." "There was one last idea that we had talked about earlier for which was feature scaling." "We saw how feature scaling can help gradient descents converge faster for linear regression." "The idea of feature scaling also applies to gradient descent for logistic regression." "And if you have features that are on very different scales." "Then applying feature scaling can also make it, gradient descent, run faster for logistic regression." "So, that's it." "You now know how to implement logistic regression, and this is a very powerful and probably even most widely used classification algorithm in the world." "And you now know how we get to work with yourself." "In the last video, we talked about gradient descent for minimizing the cost function J of theta for logistic regression." "In this video, I'd like to tell you about some advanced optimization algorithms and some advanced optimization concepts." "Using some of these ideas, we'll be able to get logistic regression" "to run much more quickly than it's possible with gradient descent." "And this will also let the algorithms scale much better to very large machine learning problems, such as if we had a very large number of features." "Here's an alternative view of what gradient descent is doing." "We have some cost function J and we want to minimize it." "So what we need to is, we need to write code that can take as input the parameters theta and they can compute two things:" "J of theta and these partial derivative terms for, you know, J equals 0, 1 up to N. Given code that can do these two things, what gradient descent does is it repeatedly performs the following update." "Right?" "So given the code that we wrote to compute these partial derivatives, gradient descent plugs in here and uses that to update our parameters theta." "So another way of thinking about gradient descent is that we need to supply code to compute J of theta and these derivatives, and then these get plugged into gradient descents, which can then try to minimize the function for us." "For gradient descent, I guess technically you don't actually need code to compute the cost function J of theta." "You only need code to compute the derivative terms." "But if you think of your code as also monitoring convergence of some such, we'll just think of ourselves as providing code to compute both the cost function and the derivative terms." "So, having written code to compute these two things, one algorithm we can use is gradient descent." "But gradient descent isn't the only algorithm we can use." "And there are other algorithms, more advanced, more sophisticated ones, that, if we only provide them a way to compute these two things, then these are different approaches to optimize the cost function for us." "So conjugate gradient BFGS and" "L-BFGS are examples of more sophisticated optimization algorithms that need a way to compute J of theta, and need a way to compute the derivatives, and can then use more sophisticated strategies than gradient descent to minimize the cost function." "The details of exactly what these three algorithms is well beyond the scope of this course." "And in fact you often end up spending, you know, many days, or a small number of weeks studying these algorithms." "If you take a class and advance the numerical computing." "But let me just tell you about some of their properties." "These three algorithms have a number of advantages." "One is that, with any of this algorithms you usually do not need to manually pick the learning rate alpha." "So one way to think of these algorithms is that given is the way to compute the derivative and a cost function." "You can think of these algorithms as having a clever inter-loop." "And, in fact, they have a clever inter-loop called a line search algorithm that automatically tries out different values for the learning rate alpha and automatically picks a good learning rate alpha so that it can even pick a different learning rate for every iteration." "And so then you don't need to choose it yourself." "These algorithms actually do more sophisticated things than just pick a good learning rate, and so they often end up converging much faster than gradient descent." "These algorithms actually do more sophisticated things than just pick a good learning rate, and so they often end up converging much faster than gradient descent, but detailed discussion of exactly what they do is beyond the scope of this course." "In fact, I actually used to have used these algorithms for a long time, like maybe over a decade, quite frequently, and it was only, you know, a few years ago that I actually figured out for myself the details" "of what conjugate gradient, BFGS and O-BFGS do." "So it is actually entirely possible to use these algorithms successfully and apply to lots of different learning problems without actually understanding the inter-loop of what these algorithms do." "If these algorithms have a disadvantage," "I'd say that the main disadvantage is that they're quite a lot more complex than gradient descent." "And in particular, you probably should not implement these algorithms" " conjugate gradient, L-BGFS, BFGS - yourself unless you're an expert in numerical computing." "Instead, just as I wouldn't recommend that you write your own code to compute square roots of numbers or to compute inverses of matrices, for these algorithms also what I would recommend you do is just use a software library." "So, you know, to take a square root what all of us do is use some function that someone else has written to compute the square roots of our numbers." "And fortunately, Octave and the closely related language MATLAB" " we'll be using that " "Octave has a very good." "Has a pretty reasonable library implementing some of these advanced optimization algorithms." "And so if you just use the built-in library, you know, you get pretty good results." "I should say that there is a difference between good and bad implementations of these algorithms." "And so, if you're using a different language for your machine learning application, if you're using" "C, C++, Java, and so on, you might want to try out a couple of different libraries to make sure that you find a good library for implementing these algorithms." "Because there is a difference in performance between a good implementation of, you know, contour gradient or" "LPFGS versus less good implementation of contour gradient or LPFGS." "So now let's explain how to use these algorithms, I'm going to do so with an example." "Let's say that you have a problem with two parameters" "equals theta zero and theta one." "And let's say your cost function is J of theta equals theta one minus five squared, plus theta two minus five squared." "So with this cost function." "You know the value for theta 1 and theta 2." "If you want to minimize J of theta as a function of theta." "The value that minimizes it is going to be theta 1 equals 5, theta 2 equals equals five." "Now, again, I know some of you know more calculus than others, but the derivatives of the cost function J turn out to be these two expressions." "I've done the calculus." "So if you want to apply one of the advanced optimization algorithms to minimize cost function J." "So, you know, if we didn't know the minimum was at" "5, 5, but if you want to have a cost function 5 the minimum numerically using something like gradient descent but preferably more advanced than gradient descent, what you would do is implement an octave function like this, so we" "implement a cost function, cost function theta function like that, and what this does is that it returns two arguments, the first J-val, is how" "we would compute the cost function" "J. And so this says J-val equals, you know, theta one minus five squared plus theta two minus five squared." "So it's just computing this cost function over here." "And the second argument that this function returns is gradient." "So gradient is going to be a two by one vector," "and the two elements of the gradient vector correspond to the two partial derivative terms over here." "Having implemented this cost function, you would, you can then" "call the advanced optimization function called the fminunc" " it stands for function minimization unconstrained in Octave" "and the way you call this is as follows." "You set a few options." "This is a options as a data structure that stores the options you want." "So grant up on, this sets the gradient objective parameter to on." "It just means you are indeed going to provide a gradient to this algorithm." "I'm going to set the maximum number of iterations to, let's say, one hundred." "We're going give it an initial guess for theta." "There's a 2 by 1 vector." "And then this command calls fminunc." "This at symbol presents a pointer to the cost function" "that we just defined up there." "And if you call this, this will compute, you know, will use one of the more advanced optimization algorithms." "And if you want to think it as just like gradient descent." "But automatically choosing the learning rate alpha for so you don't have to do so yourself." "But it will then attempt to use the sort of advanced optimization algorithms." "Like gradient descent on steroids." "To try to find the optimal value of theta for you." "Let me actually show you what this looks like in Octave." "So I've written this cost function of theta function exactly as we had it on the previous line." "It computes J-val which is the cost function." "And it computes the gradient with the two elements being the partial derivatives of the cost function with respect to, you know, the two parameters, theta one and theta two." "Now let's switch to my Octave window." "I'm gonna type in those commands I had just now." "So, options equals optimset." "This is the notation for setting my parameters on my options, for my optimization algorithm." "Grant option on, maxIter, 100 so that says 100 iterations, and I am going to provide the gradient to my algorithm." "Let's say initial theta equals zero's two by one." "So that's my initial guess for theta." "And now I have of theta, function val exit flag" "equals fminunc constraint." "A pointer to the cost function." "and provide my initial guess." "And the options like so." "And if I hit enter this will run the optimization algorithm." "And it returns pretty quickly." "This funny formatting that's because my line, you know, my" "code wrapped around." "So, this funny thing is just because my command line had wrapped around." "But what this says is that numerically renders, you know, think of it as gradient descent on steroids, they found the optimal value of a theta is theta 1 equals 5, theta 2 equals 5, exactly as we're hoping for." "The function value at the optimum is essentially 10 to the minus 30." "So that's essentially zero, which is also what we're hoping for." "And the exit flag is 1, and this shows what the convergence status of this." "And if you want you can do help fminunc to read the documentation for how to interpret the exit flag." "But the exit flag let's you verify whether or not this algorithm thing has converged." "So that's how you run these algorithms in Octave." "I should mention, by the way, that for the Octave implementation, this value of theta, your parameter vector of theta, must be in rd for d greater than or equal to 2." "So if theta is just a real number." "So, if it is not at least a two-dimensional vector or some higher than two-dimensional vector, this fminunc may not work, so and if in case you have a one-dimensional function that you use to optimize, you can look" "in the octave documentation for fminunc for additional details." "So, that's how we optimize our trial example of this simple quick driving cost function." "However, we apply this to let's just say progression." "In logistic regression we have a parameter vector theta, and" "I'm going to use a mix of octave notation and sort of math notation." "But I hope this explanation will be clear, but our parameter vector theta comprises these parameters theta 0 through theta n because octave indexes," "vectors using indexing from 1, you know, theta 0 is actually written theta 1 in octave, theta 1 is gonna be written." "So, if theta 2 in octave and that's gonna be a written theta n+1, right?" "And that's because Octave indexes is vectors starting from index of 1 and so the index of 0." "So what we need to do then is write a cost function that captures the cost function for logistic regression." "Concretely, the cost function needs to return J-val, which is, you know, J-val as you need some codes to compute J of theta and we also need to give it the gradient." "So, gradient 1 is going to be some code to compute the partial derivative in respect to theta 0, the next partial derivative respect to theta 1 and so on." "Once again, this is gradient" "1, gradient 2 and so on, rather than gradient 0, gradient" "1 because octave indexes is vectors starting from one rather than from zero." "But the main concept I hope you take away from this slide is, that what you need to do, is write a function that returns" "the cost function and returns the gradient." "And so in order to apply this to logistic regression or even to linear regression, if you want to use these optimization algorithms for linear regression." "What you need to do is plug in the appropriate code to compute these things over here." "So, now you know how to use these advanced optimization algorithms." "Because, using, because for these algorithms, you're using a sophisticated optimization library, it makes the just a little bit more opaque and so just maybe a little bit harder to debug." "But because these algorithms often run much faster than gradient descent, often quite typically whenever" "I have a large machine learning problem, I will use these algorithms instead of using gradient descent." "And with these ideas, hopefully, you'll be able to get logistic progression and also linear regression to work on much larger problems." "So, that's it for advanced optimization concepts." "And in the next and final video on Logistic Regression," "I want to tell you how to take the logistic regression algorithm that you already know about and make it work also on multi-class classification problems." "In this video we'll talk about how to get logistic regression to work for multi-class classification problems, and in particular I want to tell you about an algorithm called one-versus-all classification." "What's a multi-class classification problem?" "Here are some examples." "Let's say you want a learning algorithm to automatically put your email into different folders or to automatically tag your emails." "So, you might have different folders or different tags for work email, email from your friends, email from your family and emails about your hobby." "And so, here, we have a classification problem with 4 classes, which we might assign the numbers, the classes y1, y2, y3 and y4 to, Another example for a medical diagnosis: if a patient" "comes into your office with maybe a stuffy nose, the possible diagnoses could be that they're not ill, maybe that's y1; or they have a cold, 2; or they have the flu." "And the third and final example, if you are using machine learning to classify the weather, you know, maybe you want to decide that the weather is sunny, cloudy, rainy or snow, or if there's gonna be snow." "And so, in all of these examples, Y can take on a small number of discreet values, maybe 1 to" "3, 1 to 4 and so on, and these are multi-class classification problems." "And by the way, it doesn't really matter whether we index as" "0123 or as 1234, I tend to index that my classes starting from 1 rather than starting from 0." "But either way, where often, it really doesn't matter." "Whereas previously, for a binary classification problem, our data sets look like this." "For a multi-class classification problem, our data sets may look like this, where here, I'm using three different symbols to represent our three classes." "So, the question is:" "Given the data set with three classes where this is a the example of one class, that's the example of the different class, and, that's the example of yet, the third class." "How do we get a learning algorithm to work for the setting?" "We already know how to do binary classification, using logistic regression, we know how the, you know, maybe, for the straight line, to separate the positive and negative classes." "Using an idea called one versus all classification, we can then take this, and, make it work for multi-class classification, as well." "Here's how one versus all classification works." "And, this is also sometimes called "one versus rest."" "Let's say, we have a training set, like that shown on the left, where we have 3 classes." "So, if y1, we denote that with a triangle if y2 the square and, if y3 then, the cross." "What we're going to do is, take a training set, and, turn this into three separate binary classification problems." "So, I'll turn this into three separate two class classification problems." "So let's start with Class 1, which is a triangle." "We are going to essentially create a new, sort of fake training set." "where classes 2 and 3 get assigned to the negative class and class 1 gets assigned to the positive class when we create a new training set if that's showing on the right and we're going to fit a classifier, which I'm" "going to call h subscript theta superscript 1 of x where here, the triangles are the positive examples and the circles are the negative examples." "So, think of the triangles be assigned the value of 1 and the circles the sum, the value of zero." "And we're just going to train a standard logistic regression crossfire" "and maybe that will give us a position boundary." "OK?" "The superscript 1 here is the class one." "So, we're doing this for the triangle first class." "Next, we do the same thing for class 2." "Going to take the squares and assign the squares as the positive class and assign every thing else the triangles and the crosses as the negative class." "and then we fit a second logistic regression classifier." "I'm gonna call this H of X superscript 2, where the superscript 2 denotes that we're now doing this: treating the square class as the positive class and maybe we get the classifier like that." "And finally, we do the same thing for the third class and fit a third classifier H superscript 3 of X and maybe this will give us a decision boundary or give us a classifier that separates the positive and negative examples like that." "So, to summarize, what we've done is we fit 3 classifiers." "So, for I equals 1 2 3 we'll fit a classifier" "H superscript I subscript theta of X, thus trying to estimate what is the probability that y is equal to class I given X and prioritize by theta." "Right?" "So, in the first instance, for this first one up here, this classifier with learning to by the triangle." "So it's thinking of the triangles as a positive class." "So, X superscript one is essentially trying to estimate what is the probability that the Y is equal to one, given" "X and parametrized by theta." "And similarly, this is treating, you know, the square class as a positive took pause so its trying to estimate the probability that y2 and so on." "So we now have 3 classifiers each of which was trained is one of the three crosses." "Just to summarize, what we've done is we've, we want to train a logistic regression classifier, H superscript I of X, for each plus i that predicts it's probably a y i." "Finally, to make a prediction when we give it a new input x and we want to make a prediction, we do is we just run Let's say three what run our 3 of our classifiers on the input" "x and we then pick the class i that maximizes the three." "So, we just you know, basically pick the classifier, pick whichever one of the three classifiers is most confident, or most enthusistically" "says that it thinks it has a right class." "So, whichever value of i gives us the highest probability, we then predict y to be that value." "So, that's it for multi-class classification and one-versus-all method." "And with this little method you can now take the logistic regression classifier and make it work on multi-class classification problems as well." "In the last video, we talked about the recommender system problem, where for example, you may have a set of movies and you may have a set of users, each of whom has rated some subset of the movies," "rated the movies 1 to 5 stars or 0 to 5 stars, and what I would like to do, is look at these users and predict how they would have rated other movies that they have not yet rated." "In this video, I would like to talk about our first approach to building a recommender system, this approach is called content based recommendations." "Here's our data set from before, and just to remind you of a bit of notation, I was using" "Nu to denote the number of users, and so that's equal to 4, and Nm to denote the number of movies, I have five movies." "So, how do I predict what these missing values would be?" "Let's suppose that for each of these movies, I have a set of features for them." "In particular, lets say that for each of the movies I have two features," "which I'm going to denote X1 and" "X2, where X1 measures the degree to which a movie is a romantic movie and X2 measures the degree to which a movie is an action movie." "So if you take a movie" "Love at last, you know, 0.9 rating on the romance scale, it is a highly romantic movie but zero on the action scale, so almost no action in that movie." "Romance forever was 1.0, lot of romance and 0.01 action," "I don't know maybe there's a minor car crash in that movie or something, so little bit of action." "Skipping one let's do" "Swords vs,. karate, maybe that has a zero romance rating and no romance at all in that but plenty of action and you know, non-stop car chases." "Maybe again there is tiny bit of romance in that movie, but mainly action, and and Cute puppies of love again but mainly a romance movie with no action at all." "So if we have features like these then each movie can be represented with a feature vector." "Let's take movie 1, so just call these movies you know, movies 1 2, 3, 4 and 5." "For my first movie," "Love at last, I have my two features, 0.9 and" "0, and so these are features" "X1 and X2, and let's add an extra feature as usual, which is my interceptor feature X0, which is equal to 1" "and so, putting these together," "I would then have a feature X1, the superscript 1 denotes it's the feature vector for my first movie, and this feature vector is equal to one." "The first one there is this interceptor, and then my two features 0.9, 0, like so." "So, for Love at last, I would have a feature vector X1," "for the movie Romance Forever, we have the separate feature vector" "X2 and so on, and for Swords vs. karate I would have a different feature vector x superscript 5." "Also, consistent with our early notation that we were using, we're going to set N to be the number of features, not counting this X zero intercept term so n is equal to two because we have" "two features x1 and x2 capturing the degree of romance and the degree of action in each movie." "Now in order to make predictions, here is one thing we could do, which is that we could treat predicting the ratings of each user as a separate linear regression problem." "So specifically lets say that for each user j we are going to learn a parameter vector theta" "J which would be in R3 in this case, more generally theta j would be in r n+1, where n is the number of features, not counting the intercept term, and we're going to predict user J as" "rating movie I, with just the inner product between the parameters vector theta and the features "XI"." "So, let's take a specific example." "Let's take user one." "So that would be Alice and associated with Alice would be some parameter vector, theta 1 and our second user Bob will be associated, with a different parameter vector theta 2." "Carol will be associated with a different parameter vector theta" "3 and Dave a different parameter vector, theta 4." "So lets say we want to make a prediction for what Alice will think of the movie, Cute puppies of love." "Well that movie is going to have some parameter vector X3, where we have that X3 is going to be equal to 1 which is my intercept term, and then 0.99, and then 0." "And let's say for this example, let's say that you know we have somehow already gotten a parameter vector theta 1 for Alice--we we will say later exactly how we come up with this parameter vector--but let's" "just say for now that you know some unspecified learning algorithm has learned the parameter vector theta 1 and it is equal to 0 5 0." "And so our prediction for this entry is going to be equal to theta 1, that is Alice's parameter vector, transpose X3, that is the feature vector for the Cute Puppies of Love movie number 3." "And so the inner product between these two vectors" "is going to be 5 x 0.99." "Which is equal to 4.95." "And so my prediction for value this over here is going to be 4.95." "And maybe that seems like a reasonable value, if indeed" "this is my parameter vector theta 1." "So all we doing here is we are applying a different copy of essentially linear regression for each user and we are saying that what Alice does, is" "Alice has seem some parameter vector theta 1 that she uses," "that we use to predict her ratings as a function of how romantic and how action packed the movie is and Bob, and Carol, and" "Dave each of them have a different linear function of the romantic-ness and action-ness or the degree of romance and the degree of action" "in a movie, and that that is how we're going to predict their star ratings." "More formally here is how we can write down the problem." "Our notation is that RIJ is equal to one, if user J has rated movie I, and YIJ is the rating" "of that movie if that rating exists." "That is if that user has actually rated that movie." "And on the previous slide we also defined theta J which is a parameter for each user XI which is a feature vector for specific movie and for each user and each movie you would predict that rating, as follows." "So let me introduce, just temporarily, introduce one extra bit of notation mj, we are gonna use mj to denote the number of users rated by movie j, we're gonna need this notation only for this slide." "Now, in order to learn the parameter vector for theta j, well, how can we do so?" "This is basically a linear regression problem." "So what we can do, is just choose a parameter vector, theta j, so the predicted value here are as close as possible to the values that we observed in our training set, the values we observed in our data." "So, let's write that down." "In order to learn the parameter vector theta j, let's minimize over my parameter vector theta j, of sum" "and I want to sum over all movies that user j has rated--so we write this as sum over all values of i that is a colon rij equals 1." "So the way to read this summation index is this is summation over all the values of i, so that r i j is equal to 1." "So this is going to be summing over all the movies that user j has rated." "And then I am going to compute theta j" "transpose xi so that's the prediction of user j's rating on movie i, minus y i j, so that's the actual observed rating squared," "and then, let me just divide by the number of movies that user J, has actually rated, so just divide by 1 over 2MJ." "And so this is just like the least squares regression, it's just like linear regression where we want to choose the parameter vector theta J, to minimize this type of squared error term." "And if you want to, you can also add in a regularization term so plus lambda over 2m, and" "this is really 2MJ because, this as if we have MJ examples right?" "Because if user J has rated that many movies, it's sort of like we have that many data points with which to fit the parameters theta J. And then let me add in my usual regularization term here of" "theta J K squared." "As usual this sum is from" "K equals 1 through N so here theta J is going to be an N plus" "1 dimensional vector where, in our early example, n was equal to two, but more generally, n is the number of features we have per movie." "And so as usual we don't regularize over theta 0." "We don't regularize over the biased term because the sum is from K1 through N. If you minimize this as a function of theta J you get a good solution, you get a pretty good estimate of a parameter vector theta j" "with which to make the predictions for user J's movie ratings." "For recommender systems, we're going to change this notation a little bit." "So to simplify the subsequent math," "I'm actually going to get rid of this term MJ." "So that's just a constant right so I can delete it without changing the value of theta J that" "I get out of this optimization, so if you imagine taking this whole equation, taking this whole expression and multiplying it by" "MJ and get rid of that constant, and when" "I minimize this I should still get the same value of theta J as before." "So, just to repeat what we wrote on the previous slide, here is our optimization objective:" "In order to learn theta J, which is a parameter for user J, we're going to minimize over theta j this optimization objectives." "So this is our usual squared error term and then this is our regularization term." "Now of course in building a recommender system we don't just want to learn parameters for a single user, we want to learn parameters for all of our users, I have n subscript u users, so I want to" "learn all of these parameters and so what I'm going to do is take this minimization, take this optimization objective and just add an extra summation there." "So, you know, this expression here with the one half on top again, so it's exactly the same as what we have on top except that now, instead of just doing this for a specific user theta" "J, I'm going to sum my objective over all of my users and then minimize this overall optimization objective." "Minimize this overall cost function." "And when I minimize this as a function of theta 1, theta 2, up to theta NU, I will get a separate parameter vector each user and" "I can then use that to make predictions for all of my users for all of my N subscript u users." "So putting everything together, this was our optimization objective on top and to give this thing a name, I'll just call this" "J of theta 1, dot, dot, dot theta NU." "So J as usual is my optimization objective which I'm trying to minimize." "Next, in order to actually do the minimization, if you were to derive the gradient descent updates, these are the equations you would get," "so you would take theta" "JK and subtract from it alpha, which is the learning rate, times these terms here on the right." "So we have slightly different cases so when K equals 0 and when K is not equal to 0, because our regularization term here regularizes only the values of theta JK for" "K not equal to zero." "So we don't regularize theta 0 so the slightly different updates for k equals zero, and k not equal to 0." "And this term, over here, for example is just a partial derivative with respect to your parameter, that of your" "optimization objective, right?" "And so, this is just gradient descent and I've already computed the derivatives and plugged them into here." "If these gradient descent updates look a lot like what we had for linear regression, that's because these are essentially the same as linear regression." "The only minor difference is that for linear regression we have these 1 over M terms" " it's really 1 over MJ - but because earlier when we were deriving the optimization objective we got rid of this, that's why we don't have this 1 over M term." "But otherwise it's really sum over my training examples of, you know, the error times" "XK plus that regularization term contributes to the derivative." "So if you are using gradient descent, here is how you can minimize the cost function j, to learn all the parameters, and using these formulas for the derivatives, if you want, you can also plug them" "into a more advanced optimization algorithm like cluster gradient or" "LBFGS or what have you, and use that to try to minimize the cost function J as well." "So hopefully you now know how you can apply essentially a variation on linear regression in order to predict different movie ratings by different users." "This particular algorithm is called a content based recommendations, or content based approach because we assume that we have available to us, features for the different movies." "So we have features that capture what is the content of these movies." "How romantic is this movie?" "How much action is in this movie?" "And we are really using features of the content of the movies to make our predictions." "But for many movies we don't actually have such features, or it may be very difficult to get such features for all of our movies, for all of whatever items we are trying to sell." "So in the next video, we'll start to talk about an approach to recommender systems that isn't content based and does not assume that we have someone else giving us all of these features, for all of the movies in our data set." "By now, you've seen a couple different learning algorithms, linear regression and logistic regression." "They work well for many problems, but when you apply them to certain machine learning applications, they can run into a problem called overfitting that can cause them to perform very poorly." "What I'd like to do in this video is explain to you what is this overfitting problem, and in the next few videos after this, we'll talk about a technique called regularization, that will allow us to ameliorate or to" "reduce this overfitting problem and get these learning algorithms to maybe work much better." "So what is overfitting?" "Let's keep using our running example of predicting housing prices with linear regression where we want to predict the price as a function of the size of the house." "One thing we could do is fit a linear function to this data, and if we do that, maybe we get that sort of straight line fit to the data." "But this isn't a very good model." "Looking at the data, it seems pretty clear that as the size of the housing increases, the housing prices plateau, or kind of flattens out as we move to the right and so this algorithm does not fit the training and we" "call this problem underfitting, and another term for this is that this algorithm has high bias." "Both of these roughly mean that it's just not even fitting the training data very well." "The term is kind of a historical or technical one, but the idea is that if a fitting a straight line to the data, then, it's as if the algorithm has a very strong preconception, or a" "very strong bias that housing prices are going to vary linearly with their size and despite the data to the contrary." "Despite the evidence of the contrary is preconceptions still are bias, still closes it to fit a straight line and this ends up being a poor fit to the data." "Now, in the middle, we could fit a quadratic functions enter and, with this data set, we fit the quadratic function, maybe, we get that kind of curve and, that works pretty well." "And, at the other extreme, would be if we were to fit, say a fourth other polynomial to the data." "So, here we have five parameters, theta zero through theta four, and, with that, we can actually fill a curve that process through all five of our training examples." "You might get a curve that looks like this." "That, on the one hand, seems to do a very good job fitting the training set and, that is processed through all of my data, at least." "But, this is still a very wiggly curve, right?" "So, it's going up and down all over the place, and, we don't actually think that's such a good model for predicting housing prices." "So, this problem we call overfitting, and, another term for this is that this algorithm has high variance.." "The term high variance is another historical or technical one." "But, the intuition is that, if we're fitting such a high order polynomial, then, the hypothesis can fit, you know, it's almost as if it can fit almost any function and this face of possible hypothesis" "is just too large, it's too variable." "And we don't have enough data to constrain it to give us a good hypothesis so that's called overfitting." "And in the middle, there isn't really a name but I'm just going to write, you know, just right." "Where a second degree polynomial, quadratic function seems to be just right for fitting this data." "To recap a bit the problem of over fitting comes when if we have too many features, then to learn hypothesis may fit the training side very well." "So, your cost function may actually be very close to zero or may be even zero exactly, but you may then end up with a curve like this that, you know tries too hard to fit the training set, so that it" "even fails to generalize to new examples and fails to predict prices on new examples as well, and here the term generalized refers to how well a hypothesis applies even to new examples." "That is to data to houses that it has not seen in the data center." "On this slide, we looked at over fitting for the case of linear regression." "A similar thing can apply to logistic regression as well." "Here is a logistic regression example with two features X1 and x2." "One thing we could do, is fit logistic regression with just a simple hypothesis like this, where, as usual, G is my sigmoid function." "And if you do that, you end up with a hypothesis, trying to use, maybe, just a straight line to separate the positive and the negative examples." "And this doesn't look like a very good fit to the hypothesis." "So, once again, this is an example of underfitting or of the hypothesis having high bias." "In contrast, if you were to add to your features these quadratic terms, then, you could get a decision boundary that might look more like this." "And, you know, that's a pretty good fit to the data." "Probably, about as good as we could get, on this training set." "And, finally, at the other extreme, if you were to fit a very high-order polynomial, if you were to generate lots of high-order polynomial terms of speeches, then, logistical regression may contort itself, may try really" "hard to find a decision boundary that fits your training data or go to great lengths to contort itself, to fit every single training example well." "And, you know, if the features X1 and" "X2 offer predicting, maybe, the cancer to the, you know, cancer is a malignant, benign breast tumors." "This doesn't, this really doesn't look like a very good hypothesis, for making predictions." "And so, once again, this is an instance of overfitting and, of a hypothesis having high variance and not really, and, being unlikely to generalize well to new examples." "Later, in this course, when we talk about debugging and diagnosing things that can go wrong with learning algorithms, we'll give you specific tools to recognize when overfitting and, also, when underfitting may be occurring." "But, for now, lets talk about the problem of, if we think overfitting is occurring, what can we do to address it?" "In the previous examples, we had one or two dimensional data so, we could just plot the hypothesis and see what was going on and select the appropriate degree polynomial." "So, earlier for the housing prices example, we could just plot the hypothesis and, you know, maybe see that it was fitting the sort of very wiggly function that goes all over the place to predict housing prices." "And we could then use figures like these to select an appropriate degree polynomial." "So plotting the hypothesis, could be one way to try to decide what degree polynomial to use." "But that doesn't always work." "And, in fact more often we may have learning problems that where we just have a lot of features." "And there is not just a matter of selecting what degree polynomial." "And, in fact, when we have so many features, it also becomes much harder to plot the data and it becomes much harder to visualize it, to decide what features to keep or not." "So concretely, if we're trying predict housing prices sometimes we can just have a lot of different features." "And all of these features seem, you know, maybe they seem kind of useful." "But, if we have a lot of features, and, very little training data, then, over fitting can become a problem." "In order to address over fitting, there are two main options for things that we can do." "The first option is, to try to reduce the number of features." "Concretely, one thing we could do is manually look through the list of features, and, use that to try to decide which are the more important features, and, therefore, which are the features we should" "keep, and, which are the features we should throw out." "Later in this course, where also talk about model selection algorithms." "Which are algorithms for automatically deciding which features to keep and, which features to throw out." "This idea of reducing the number of features can work well, and, can reduce over fitting." "And, when we talk about model selection, we'll go into this in much greater depth." "But, the disadvantage is that, by throwing away some of the features, is also throwing away some of the information you have about the problem." "For example, maybe, all of those features are actually useful for predicting the price of a house, so, maybe, we don't actually want to throw some of our information or throw some of our features away." "The second option, which we'll talk about in the next few videos, is regularization." "Here, we're going to keep all the features, but we're going to reduce the magnitude" "or the values of the parameters theta J. And, this method works well, we'll see, when we have a lot of features, each of which contributes a little bit to predicting the value of Y, like we" "saw in the housing price prediction example." "Where we could have a lot of features, each of which are, you know, somewhat useful, so, maybe, we don't want to throw them away." "So, this subscribes the idea of regularization at a very high level." "And, I realize that, all of these details probably don't make sense to you yet." "But, in the next video, we'll start to formulate exactly how to apply regularization and, exactly what regularization means." "And, then we'll start to figure out, how to use this, to make how learning algorithms work well and avoid overfitting." "In this video, I'd like to convey to you, the main intuitions behind how regularization works." "And, we'll also write down the cost function that we'll use, when we were using regularization." "With the hand drawn examples that we have on these slides, I think I'll be able to convey part of the intuition." "But, an even better way to see for yourself, how regularization works, is if you implement it, and, see it work for yourself." "And, if you do the appropriate exercises after this, you get the chance to self see regularization in action for yourself." "So, here is the intuition." "In the previous video, we saw that, if we were to fit a quadratic function to this data, it gives us a pretty good fit to the data." "Whereas, if we were to fit an overly high order degree polynomial, we end up with a curve that may fit the training set very well, but, really not be a, but overfit the data poorly, and, not generalize well." "Consider the following, suppose we were to penalize, and, make the parameters theta 3 and theta 4 really small." "Here's what I mean, here is our optimization" "objective, or here is our optimization problem, where we minimize our usual squared error cause function." "Let's say I take this objective and modify it and add to it, plus 1000 theta" "3 squared, plus 1000 theta 4 squared." "1000 I am just writing down as some huge number." "Now, if we were to minimize this function, the only way to make this new cost function small is if theta 3 and data" "4 are small, right?" "Because otherwise, if you have a thousand times theta 3, this new cost functions gonna be big." "So when we minimize this new function we are going to end up with theta 3 close to 0 and theta" "4 close to 0, and as if we're getting rid of these two terms over there." "And if we do that, well then, if theta 3 and theta 4 close to 0 then we are being left with a quadratic function, and, so, we end up with a fit to the data, that's, you know, quadratic" "function plus maybe, tiny contributions from small terms, theta 3, theta 4, that they may be very close to 0." "And, so, we end up with essentially, a quadratic function, which is good." "Because this is a much better hypothesis." "In this particular example, we looked at the effect of penalizing two of the parameter values being large." "More generally, here is the idea behind regularization." "The idea is that, if we have small values for the parameters, then, having small values for the parameters," "will somehow, will usually correspond to having a simpler hypothesis." "So, for our last example, we penalize just theta 3 and theta 4 and when both of these were close to zero, we wound up with a much simpler hypothesis that was essentially a quadratic function." "But more broadly, if we penalize all the parameters usually that, we can think of that, as trying to give us a simpler hypothesis as well because when, you know, these parameters are as close as you in this" "example, that gave us a quadratic function." "But more generally, it is possible to show that having smaller values of the parameters" "corresponds to usually smoother functions as well for the simpler." "And which are therefore, also, less prone to overfitting." "I realize that the reasoning for why having all the parameters be small." "Why that corresponds to a simpler hypothesis;" "I realize that reasoning may not be entirely clear to you right now." "And it is kind of hard to explain unless you implement yourself and see it for yourself." "But I hope that the example of having theta 3 and theta" "4 be small and how that gave us a simpler hypothesis, I hope that helps explain why, at least give some intuition as to why this might be true." "Lets look at the specific example." "For housing price prediction we may have our hundred features that we talked about where may be x1 is the size, x2 is the number of bedrooms, x3 is the number of floors and so on." "And we may we may have a hundred features." "And unlike the polynomial example, we don't know, right, we don't know that theta 3, theta 4, are the high order polynomial terms." "So, if we have just a bag, if we have just a set of a hundred features, it's hard to pick in advance which are the ones that are less likely to be relevant." "So we have a hundred or a hundred one parameters." "And we don't know which ones to pick, we don't know which parameters to try to pick, to try to shrink." "So, in regularization, what we're going to do, is take our cost function, here's my cost function for linear regression." "And what I'm going to do is, modify this cost function to shrink all of my parameters, because, you know," "I don't know which one or two to try to shrink." "So I am going to modify my cost function to add a term at the end." "Like so we have square brackets here as well." "When I add an extra regularization term at the end to shrink every single parameter and so this term we tend to shrink all of my parameters theta 1, theta 2, theta 3 up to theta 100." "By the way, by convention the summation here starts from one so I am not actually going penalize theta zero being large." "That sort of the convention that, the sum I equals one through" "N, rather than I equals zero through N. But in practice, it makes very little difference, and, whether you include, you know, theta zero or not, in practice, make very little difference to the results." "But by convention, usually, we regularize only theta through theta" "100." "Writing down our regularized optimization objective, our regularized cost function again." "Here it is." "Here's J of theta where, this term on the right is a regularization term and lambda here is called the regularization parameter and what lambda does, is it controls a trade off between two different goals." "The first goal, capture it by the first goal objective, is that we would like to train, is that we would like to fit the training data well." "We would like to fit the training set well." "And the second goal is, we want to keep the parameters small, and that's captured by the second term, by the regularization objective." "And by the regularization term." "And what lambda, the regularization parameter does is the controls the trade of between these two goals, between the goal of fitting the training set well and the goal of keeping the parameter plan small and therefore keeping the hypothesis relatively" "simple to avoid overfitting." "For our housing price prediction example, whereas, previously, if we had fit a very high order polynomial, we may have wound up with a very, sort of wiggly or curvy function like this." "If you still fit a high order polynomial with all the polynomial features in there, but instead, you just make sure, to use this sole of regularized objective, then what you can get out is in fact a curve that isn't" "quite a quadratic function, but is much smoother and much simpler and maybe a curve like the magenta line that, you know, gives a much better hypothesis for this data." "Once again, I realize it can be a bit difficult to see why strengthening the parameters can have this effect, but if you implement yourselves with regularization" "you will be able to see this effect firsthand." "In regularized linear regression, if the regularization parameter monitor is set to be very large, then what will happen is we will end up penalizing the parameters theta 1, theta" "2, theta 3, theta 4 very highly." "That is, if our hypothesis is this is one down at the bottom." "And if we end up penalizing theta 1, theta 2, theta" "3, theta 4 very heavily, then we end up with all of these parameters close to zero, right?" "Theta 1 will be close to zero; theta 2 will be close to zero." "Theta three and theta four will end up being close to zero." "And if we do that, it's as if we're getting rid of these terms in the hypothesis so that we're just left with a hypothesis" "that will say that." "It says that, well, housing prices are equal to theta zero," "and that is akin to fitting a flat horizontal straight line to the data." "And this is an example of underfitting, and in particular this hypothesis, this straight line it just fails to fit the training set well." "It's just a fat straight line, it doesn't go, you know, go near." "It doesn't go anywhere near most of the training examples." "And another way of saying this is that this hypothesis has too strong a preconception or too high bias that housing prices are just equal to theta zero, and despite the clear data to the contrary, you know chooses to fit a sort" "of, flat line, just a flat horizontal line." "I didn't draw that very well." "This just a horizontal flat line to the data." "So for regularization to work well, some care should be taken, to choose a good choice for the regularization parameter lambda as well." "And when we talk about multi-selection later in this course, we'll talk about a way, a variety of ways for automatically choosing the regularization parameter lambda as well." "So, that's the idea of the high regularization and the cost function reviews in order to use regularization In the next two videos, lets take these ideas and apply them to linear regression and to logistic regression, so that" "we can then get them to avoid overfitting." "For linear regression, we had previously worked out two learning algorithms, one based on gradient descent and one based on the normal equation." "In this video we will take those two algorithms and generalize them to the case of regularized linear regression." "Here's the optimization objective, that we came up with last time for regularized linear regression." "This first part is our usual, objective for linear regression, and we now have this additional regularization term, where londer is our regularization parameter, and we like to find parameters theta, that minimizes this cost function," "this regularized cost function, J of theta." "Previously, we were using gradient descent for the original" "cost function, without the regularization term, and we had the following algorithm for regular linear regression, without regularization." "We will repeatedly update the parameters theta J as follows for J equals 1,2 up through n." "Let me take this and just write the case for theta zero separately." "So, you know, I'm just gonna write the update for theta zero separately, then for the update for the parameters" "1, 2, 3, and so on up to n." "So, I haven't changed anything yet, right?" "This is just writing the update for theta zero separately from the updates from theta 1, theta" "2, theta 3, up to theta n." "And the reason I want to do this is you may remember that for our regularized linear regression," "we penalize the parameters theta 1, theta 2, and so on up to theta n, but we don't penalize theta zero." "So when we modify this algorithm for regularized linear regression, we're going to end up treating theta zero slightly differently." "Concretely, if we want to take this algorithm and modify it to use the regularized objective, all we need to do is take this term at the bottom and modify as follows." "We're gonna take this term and add minus londer M," "times theta J. And if you implement this, then you have gradient descent for trying to minimize the regularized cost function J of F theta, and concretely," "I'm not gonna do the calculus to prove it, but concretely if you look at this term, this term that's written is square brackets." "If you know calculus, it's possible to prove that that term is the partial derivative, with respect of" "J of theta, using the new definition of J of theta with the regularization term." "And somebody on this term up on top, which I guess I am drawing the salient box that's still the partial derivative respect of theta zero for J of theta." "If you look at the update for theta J, it's possible to show's something pretty interesting, concretely theta J gets updated as theta J, minus alpha times, and then you have this other term here that depends on theta J" "." "So if you group all the terms together that depending on theta J. We can show that this update can be written equivalently as follows and all I did was have, you know, theta J here is theta J times" "1 and this term is londer over M. There's also an alpha here, so you end up with alpha londer over m, multiply them to theta J and this term here, one minus alpha times londer M, is" "a pretty interesting term, it has a pretty interesting effect." "Concretely, this term one minus alpha times londer over" "M, is going to be a number that's, you know, usually a number that's a loop and less than 1, right?" "Because of alpha times londer over M is going to be positive and usually, if you're learning rate is small and M is large." "That's usually pretty small." "So this term here, it's going to be a number, it's usually, you know, a little bit less than one." "So think of it as a number like 0.99, let's say" "and so, the effect of our updates of theta J is we're going to say that theta J gets replaced by thetata J times 0.99." "Alright so theta J times 0.99 has the effect of shrinking theta J a little bit towards 0." "So this makes theta J a bit smaller." "More formally, this you know, this square norm of theta J is smaller and then after that the second term here, that's actually exactly the same as the original gradient descent updated that we had." "Before we added all this regularization stuff." "So, hopefully this gradient descent, hopefully this update makes sense, when we're using regularized linear regression what we're doing is on every regularization were multiplying data" "J by a number that is a little bit less than one, so we're shrinking the parameter a little bit, and then we're performing a, you know, similar update as before." "Of course that's just the intuition behind what this particular update is doing." "Mathematically, what it's doing is exactly gradient descent on the cost function J of theta that we defined on the previous slide that uses the regularization term." "Gradient descent was just one our two algorithms for" "fitting a linear regression model." "The second algorithm was the one based on the normal equation where, what we did was we created the design matrix "x" where each row corresponded to a separate training example." "And we created a vector" "Y, so this is a vector that is an" "M dimensional vector and that contain the labels from a training set." "So whereas X is an" "M by N plus 1 dimensional matrix." "Y is an M dimensional vector and in order to minimize the cost function change we found that of one way to do is to set theta to be equal to this." "We have X transpose X, inverse X transpose Y." "I am leaving room here, to fill in stuff of course." "And what this value for theta does, is this minimizes the cost function J of theta when we were not using regularization." "Now that we are using regularization, if you were to derive what the minimum is, and just to give you a sense of how to derive the minimum, the way you derive it is you know, take partial derivatives in respect" "to each parameter, set this to zero, and then do a bunch of math, and you can then show that is a formula like this that minimizes the cost function." "And concretely, if you are using regularization then this formula changes as follows." "Inside this parenthesis, you end up with a matrix like this." "Zero, one, one, one and so on, one until the bottom." "So this thing over here is a matrix, who's upper leftmost entry is zero." "There's ones on the diagonals and then the zeros everywhere else on this matrix." "Because I am drawing this a little bit sloppy." "But as a concrete example if N equals 2, then this matrix is going to be a three by three matrix." "More generally, this matrix is a N plus one by N plus one dimensional matrix." "So then N equals two then that matrix becomes something that looks like this." "Zero, and then ones on the diagonals, and then zeros on the rest of the diagonals." "And once again, you know, I'm not going to those this derivation." "Which is frankly somewhat long and involved." "But it is possible to prove that if you are using the new definition of" "J of theta, with the regularization objective." "Then this new formula for theta is the one that will give you the global minimum of J of theta." "So finally, I want to just quickly describe the issue of non-invertibility." "This is relatively advanced material." "So you should consider this as optional and feel free to skip it or if you listen to it and you know, possibly it don't really make sense, don't worry about it either." "But earlier when I talked the normal equation method." "We also had an optional video on the non-invertability issue." "So this is another optional part, that is sort of add on earlier optional video on non-invertibility." "Now considering setting where M the number of examples is less than or equal to N the number features." "If you have fewer examples then features then this matrix" "X transpose X will be non-invertible or singular, or the other term for this is the matrix will be degenerate and if you implement this in Octave anyway, and you use the" "P in function to take the psuedo inverse." "It will kind of do the right thing that is not clear that it will give you a very good hypothesis even though numerically the octave" "P in function will give you a result that kind of makes sense." "But, if you were doing this in a different language." "And if you were taking just the regular inverse" "which an octave is denoted with the function INV." "We're trying to take the regular inverse of X transpose X, then in this setting you find that X transpose X is singular, is non-invertible and if you're doing this in a different programming language and using some" "linear algebra library try to take the inverse of this matrix." "It just might not work because that matrix is non-invertible or singular." "Fortunately, regularization also takes care of this for us, and concretely, so long as the regularization parameter is strictly greater than zero." "It is actually possible to prove that this matrix X transpose X plus parameter time, you know,s this funny matrix here, is possible to prove that this matrix will not be singular and that this matrix will be invertible." "So using regularization also takes care of any non-invertibility issues of the X transpose X matrix as well." "So, you now know how to implement regularize linear regression." "Using this, you'll be able to avoid overfitting, even if you have lots of features in a relatively small training set." "And this should let you get linear regression to work much better for many problems." "In the next video, we'll take this regularization idea and apply it to logistic regression." "So that you'll be able to get logistic impression to avoid overfitting and perform much better as well." "For logistic regression, we previously talked about two types of optimization algorithms." "We talked about how to use gradient descent to optimize as cost function J of theta." "And we also talked about advanced optimization methods." "Ones that require that you provide a way to compute your cost function J of theta and that you provide a way to compute the derivatives." "In this video, we'll show how you can adapt both of those techniques, both gradient descent and the more advanced optimization techniques in order to have them work for regularized logistic regression." "So, here's the idea." "We saw earlier that Logistic" "Regression can also be prone to overfitting if you fit it with a very, sort of, high order polynomial features like this." "Where G is the sigmoid function and in particular you end up with a hypothesis, you know, whose decision bound to be just sort of an overly complex and extremely contortive function that really isn't such a great" "hypothesis for this training set, and more generally if you have logistic regression with a lot of features." "Not necessarily polynomial ones, but just with a lot of features you can end up with overfitting." "This was our cost function for logistic regression." "And if we want to modify it to use regularization, all we need to do is add to it the following term plus londer over 2M, sum from J equals 1, and as usual sum from J equals 1." "Rather than the sum from J equals 0, of theta J squared." "And this has to effect therefore, of penalizing the parameters theta 1 theta" "2 and so on up to theta N from being too large." "And if you do this, then it will the have the effect that even though you're fitting a very high order polynomial with a lot of parameters." "So long as you apply regularization and keep the parameters small you're more likely to get a decision boundary." "You know, that maybe looks more like this." "It looks more reasonable for separating the positive and the negative examples." "So, when using regularization even when you have a lot of features, the regularization can help take care of the overfitting problem." "How do we actually implement this?" "Well, for the original gradient descent algorithm, this was the update we had." "We will repeatedly perform the following update to theta J. This slide looks a lot like the previous one for linear regression." "But what I'm going to do is write the update for theta 0 separately." "So, the first line is for update for theta 0 and a second line is now my update for theta 1 up to theta N." "Because I'm going to treat theta 0 separately." "And in order to modify this algorithm, to use" "a regularized cos function, all I need to do is pretty similar to what we did for linear regression is actually to just modify this second update rule as follows." "And, once again, this, you know, cosmetically looks identical what we had for linear regression." "But of course is not the same algorithm as we had, because now the hypothesis is defined using this." "So this is not the same algorithm as regularized linear regression." "Because the hypothesis is different." "Even though this update that I wrote down." "It actually looks cosmetically the same as what we had earlier." "We're working out gradient descent for regularized linear regression." "And of course, just to wrap up this discussion, this term here in the square brackets, so this term here, this term is, of course, the new partial derivative for respect of theta J of the new cost function J of theta." "Where J of theta here is the cost function we defined on a previous slide that does use regularization." "So, that's gradient descent for regularized linear regression." "Let's talk about how to get regularized linear regression to work using the more advanced optimization methods." "And just to remind you for those methods what we needed to do was to define the function that's called the cost function, that takes us input the parameter vector theta and once again in the equations we've been writing here we used 0 index vectors." "So we had theta 0 up to theta N. But because Octave indexes the vectors starting from 1." "Theta 0 is written in Octave as theta 1." "Theta 1 is written in" "Octave as theta 2, and so on down to theta" "N plus 1." "And what we needed to do was provide a function." "Let's provide a function called cost function that we would then pass in to what we have, what we saw earlier." "We will use the fminunc and then you know at cost function," "and so on, right." "But the F min, u and c was the F min unconstrained and this will work with fminunc was what will take the cost function and minimize it for us." "So the two main things that the cost function needed to return were first J-val." "And for that, we need to write code to compute the cost function J of theta." "Now, when we're using regularized logistic regression, of course the cost function j of theta changes and, in particular," "now a cost function needs to include this additional regularization term at the end as well." "So, when you compute j of theta be sure to include that term at the end." "And then, the other thing that this cost function thing needs to derive with a gradient." "So gradient one needs to be set to the partial derivative of J of theta with respect to theta zero, gradient two needs to be set to that, and so on." "Once again, the index is off by one." "Right, because of the indexing from one Octave users." "And looking at these terms." "This term over here." "We actually worked this out on a previous slide is actually equal to this." "It doesn't change." "Because the derivative for theta zero doesn't change." "Compared to the version without regularization." "And the other terms do change." "And in particular the derivative respect to theta one." "We worked this out on the previous slide as well." "Is equal to, you know, the original term and then minus londer M times theta 1." "Just so we make sure we pass this correctly." "And we can add parentheses here." "Right, so the summation doesn't extend." "And similarly, you know, this other term here looks like this, with this additional term that we had on the previous slide, that corresponds to the gradient from their regularization objective." "So if you implement this cost function and pass this into fminunc or to one of those advanced optimization techniques, that will minimize the new regularized cost function J of theta." "And the parameters you get out will be the ones that correspond to logistic regression with regularization." "So, now you know how to implement regularized logistic regression." "When I walk around Silicon Valley," "I live here in Silicon Valley, there are a lot of engineers that are frankly, making a ton of money for their companies using machine learning algorithms." "And I know we've only been, you know, studying this stuff for a little while." "But if you understand linear regression, the advanced optimization algorithms and regularization, by now, frankly, you probably know quite a lot more machine learning than many, certainly now, but you probably know quite a lot more machine learning right now" "than frankly, many of the" "Silicon Valley engineers out there having very successful careers." "You know, making tons of money for the companies." "Or building products using machine learning algorithms." "So, congratulations." "You've actually come a long ways." "And you can actually, you actually know enough to apply this stuff and get to work for many problems." "So congratulations for that." "But of course, there's still a lot more that we want to teach you, and in the next set of videos after this, we'll start to talk about a very powerful cause of non-linear classifier." "So whereas linear regression, logistic regression, you know, you can form polynomial terms, but it turns out that there are much more powerful nonlinear quantifiers that can then sort of polynomial regression." "And in the next set of videos after this one, I'll start telling you about them." "So that you have even more powerful learning algorithms than you have now to apply to different problems." "In this and in the next set of videos, I'd like to tell you about a learning algorithm called a Neural Network." "We're going to first talk about the representation and then in the next set of videos talk about learning algorithms for it." "Neutral networks is actually a pretty old idea, but had fallen out of favor for a while." "But today, it is the state of the art technique for many different machine learning problems." "So why do we need yet another learning algorithm?" "We already have linear regression and we have logistic regression, so why do we need, you know, neural networks?" "In order to motivate the discussion of neural networks, let me start by showing you a few examples of machine learning problems where we need to learn complex non-linear hypotheses." "Consider a supervised learning classification problem where you have a training set like this." "If you want to apply logistic regression to this problem, one thing you could do is apply logistic regression with a lot of nonlinear features like that." "So here, g as usual is the sigmoid function, and we can include lots of polynomial terms like these." "And, if you include enough polynomial terms then, you know, maybe you can get a hypotheses" "that separates the positive and negative examples." "This particular method works well when you have only, say, two features - x1 and x2" " because you can then include all those polynomial terms of x1 and x2." "But for many interesting machine learning problems would have a lot more features than just two." "We've been talking for a while about housing prediction, and suppose you have a housing classification" "problem rather than a regression problem, like maybe if you have different features of a house, and you want to predict what are the odds that your house will be sold within the next six months, so that will be a classification problem." "And as we saw we can come up with quite a lot of features, maybe a hundred different features of different houses." "For a problem like this, if you were to include all the quadratic terms, all of these, even all of the quadratic that is the second or the polynomial terms, there would be a lot of them." "There would be terms like x1 squared, x1x2, x1x3, you know, x1x4" "up to x1x100 and then you have x2 squared, x2x3" "and so on." "And if you include just the second order terms, that is, the terms that are a product of, you know, two of these terms, x1 times x1 and so on, then, for the case of n equals" "100, you end up with about five thousand features." "And, asymptotically, the number of quadratic features grows roughly as order n squared, where n is the number of the original features, like x1 through x100 that we had." "And its actually closer to n squared over two." "So including all the quadratic features doesn't seem like it's maybe a good idea, because that is a lot of features and you might up overfitting the training set, and it can also be computationally expensive, you know, to" "be working with that many features." "One thing you could do is include only a subset of these, so if you include only the features x1 squared, x2 squared, x3 squared, up to maybe x100 squared, then the number of features is much smaller." "Here you have only 100 such quadratic features, but this is not enough features and certainly won't let you fit the data set like that on the upper left." "In fact, if you include only these quadratic features together with the original x1, and so on, up to x100 features, then you can actually fit very interesting hypotheses." "So, you can fit things like, you know, access a line of the ellipses like these, but" "you certainly cannot fit a more complex data set like that shown here." "So 5000 features seems like a lot, if you were to include the cubic, or third order known of each others, the x1, x2, x3." "You know, x1 squared, x2, x10 and x11, x17 and so on." "You can imagine there are gonna be a lot of these features." "In fact, they are going to be order and cube such features and if any is 100 you can compute that, you end up with on the order of about 170,000 such cubic features and so including these higher auto-polynomial features when" "your original feature set end is large this really dramatically blows up your feature space and this doesn't seem like a good way to come up with additional features with which to build none many classifiers when n is large." "For many machine learning problems, n will be pretty large." "Here's an example." "Let's consider the problem of computer vision." "And suppose you want to use machine learning to train a classifier to examine an image and tell us whether or not the image is a car." "Many people wonder why computer vision could be difficult." "I mean when you and I look at this picture it is so obvious what this is." "You wonder how is it that a learning algorithm could possibly fail to know what this picture is." "To understand why computer vision is hard let's zoom into a small part of the image like that area where the little red rectangle is." "It turns out that where you and I see a car, the computer sees that." "What it sees is this matrix, or this grid, of pixel intensity values that tells us the brightness of each pixel in the image." "So the computer vision problem is to look at this matrix of pixel intensity values, and tell us that these numbers represent the door handle of a car." "Concretely, when we use machine learning to build a car detector, what we do is we come up with a label training set, with, let's say, a few label examples of cars and a few" "label examples of things that are not cars, then we give our training set to the learning algorithm trained a classifier and then, you know, we may test it and show the new image and ask, "What is this new thing?"." "And hopefully it will recognize that that is a car." "To understand why we need nonlinear hypotheses, let's take a look at some of the images of cars and maybe non-cars that we might feed to our learning algorithm." "Let's pick a couple of pixel locations in our images, so that's pixel one location and pixel two location, and let's plot this car, you know, at the location, at a certain point, depending on the intensities" "of pixel one and pixel two." "And let's do this with a few other images." "So let's take a different example of the car and you know, look at the same two pixel locations" "and that image has a different intensity for pixel one and a different intensity for pixel two." "So, it ends up at a different location on the figure." "And then let's plot some negative examples as well." "That's a non-car, that's a non-car ." "And if we do this for more and more examples using the pluses to denote cars and minuses to denote non-cars, what we'll find is that the cars and non-cars end up lying in different regions of the space, and what we" "need therefore is some sort of non-linear hypotheses to try to separate out the two classes." "What is the dimension of the feature space?" "Suppose we were to use just 50 by 50 pixel images." "Now that suppose our images were pretty small ones, just 50 pixels on the side." "Then we would have 2500 pixels, and so the dimension of our feature size will be N equals 2500 where our feature vector x is a list of all the pixel testings, you know, the pixel brightness of pixel" "one, the brightness of pixel two, and so on down to the pixel brightness of the last pixel where, you know, in a typical computer representation, each of these may be values between say" "0 to 255 if it gives us the grayscale value." "So we have n equals 2500, and that's if we were using grayscale images." "If we were using RGB images with separate red, green and blue values, we would have n equals 7500." "So, if we were to try to learn a nonlinear hypothesis by including all the quadratic features, that is all the terms of the form, you know," "Xi times Xj, while with the 2500 pixels we would end up with a total of three million features." "And that's just too large to be reasonable; the computation would be very expensive to find and to represent all of these three million features per training example." "So, simple logistic regression together with adding in maybe the quadratic or the cubic features" " that's just not a good way to learn complex nonlinear hypotheses when n is large because you just end up with too many features." "In the next few videos, I would like to tell you about Neural" "Networks, which turns out to be a much better way to learn complex hypotheses, complex nonlinear hypotheses even when your input feature space, even when n is large." "And along the way I'll also get to show you a couple of fun videos of historically important applications" "of Neural networks as well that I hope those videos that we'll see later will be fun for you to watch as well." "Neural Networks are a pretty old algorithm that was originally motivated" "by the goal of having machines that can mimic the brain." "Now in this class, of course" "I'm teaching Neural Networks to you because they work really well for different machine learning problems and not, certainly not because they're logically motivated." "In this video, I'd like to give you some of the background on Neural Networks." "So that we can get a sense of what we can expect them to do." "Both in the sense of applying them to modern day machinery problems, as well as for those of you that might be interested in maybe the big AI dream of someday building truly intelligent machines." "Also, how Neural Networks might pertain to that." "The origins of Neural Networks was as algorithms that try to mimic the brain and those a sense that if we want to build learning systems while why not mimic perhaps the most amazing learning machine we know about, which is perhaps the" "brain." "Neural Networks came to be very widely used throughout the 1980's and 1990's and for various reasons as popularity diminished in the late" "90's." "But more recently," "Neural Networks have had a major recent resurgence." "One of the reasons for this resurgence is that Neural Networks are computationally some what more expensive algorithm and so, it was only, you know, maybe somewhat more recently that computers became fast enough to really run large scale" "Neural Networks and because of that as well as a few other technical reasons which we'll talk about later, modern Neural" "Networks today are the state of the art technique for many applications." "So, when you think about mimicking the brain while one of the human brain does tell me same things, right?" "The brain can learn to see process images than to hear, learn to process our sense of touch." "We can, you know, learn to do math, learn to do calculus, and the brain does so many different and amazing things." "It seems like if you want to mimic the brain it seems like you have to write lots of different pieces of software to mimic all of these different fascinating, amazing things that the brain tell us, but does" "is this fascinating hypothesis that the way the brain does all of these different things is not worth like a thousand different programs, but instead, the way the brain does it is worth just a single learning algorithm." "This is just a hypothesis but let me share with you some of the evidence for this." "This part of the brain, that little red part of the brain, is your auditory cortex and the way you're understanding my voice now is your ear is taking the sound signal and routing the sound signal to your auditory" "cortex and that's what's allowing you to understand my words." "Neuroscientists have done the following fascinating experiments where you cut the wire from the ears to the auditory cortex and you re-wire," "in this case an animal's brain, so that the signal from the eyes to the optic nerve eventually gets routed to the auditory cortex." "If you do this it turns out, the auditory cortex will learn" "to see." "And this is in every single sense of the word see as we know it." "So, if you do this to the animals, the animals can perform visual discrimination task and as they can look at images and make appropriate decisions based on the images and they're doing it with that piece of brain tissue." "Here's another example." "That red piece of brain tissue is your somatosensory cortex." "That's how you process your sense of touch." "If you do a similar re-wiring process then the somatosensory cortex will learn to see." "Because of this and other similar experiments, these are called neuro-rewiring experiments." "There's this sense that if the same piece of physical brain tissue can process sight or sound or touch then maybe there is one learning algorithm that can process sight or sound or touch." "And instead of needing to implement a thousand different programs or a thousand different algorithms to do, you know, the thousand wonderful things that the brain does, maybe what we need to do is figure out some approximation or to whatever" "the brain's learning algorithm is and implement that and that the brain learned by itself how to process these different types of data." "To a surprisingly large extent, it seems as if we can plug in almost any sensor to almost any part of the brain and so, within the reason, the brain will learn to deal with it." "Here are a few more examples." "On the upper left is an example of learning to see with your tongue." "The way it works is--this is actually a system called BrainPort undergoing" "FDA trials now to help blind people see--but the way it works is, you strap a grayscale camera to your forehead, facing forward, that takes the low resolution grayscale image of what's in front of you and you then run a wire" "to an array of electrodes that you place on your tongue so that each pixel gets mapped to a location on your tongue where maybe a high voltage corresponds to a dark pixel and a low voltage corresponds to a bright" "pixel and, even as it does today, with this sort of system you and I will be able to learn to see, you know, in tens of minutes with our tongues." "Here's a second example of human echo location or human sonar." "So there are two ways you can do this." "You can either snap your fingers, or click your tongue." "I can't do that very well." "But there are blind people today that are actually being trained in schools to do this and learn to interpret the pattern of sounds bouncing off your environment - that's sonar." "So, if after you search on YouTube, there are actually videos of this amazing kid who tragically because of cancer had his eyeballs removed, so this is a kid with no eyeballs." "But by snapping his fingers, he can walk around and never hit anything." "He can ride a skateboard." "He can shoot a basketball into a hoop and this is a kid with no eyeballs." "Third example is the" "Haptic Belt where if you have a strap around your waist, ring up buzzers and always have the northmost one buzzing." "You can give a human a direction sense similar to maybe how birds can, you know, sense where north is." "And, some of the bizarre example, but if you plug a third eye into a frog, the frog will learn to use that eye as well." "So, it's pretty amazing to what extent is as if you can plug in almost any sensor to the brain and the brain's learning algorithm will just figure out how to learn from that data and deal with that data." "And there's a sense that if we can figure out what the brain's learning algorithm is, and, you know, implement it or implement some approximation to that algorithm on a computer, maybe that would be our best shot" "at, you know, making real progress towards the AI, the artificial intelligence dream of someday building truly intelligent machines." "Now, of course, I'm not teaching Neural Networks, you know, just because they might give us a window into this far-off" "AI dream, even though I'm personally, that's one of the things that I personally work on in my research life." "But the main reason I'm teaching Neural Networks in this class is because it's actually a very effective state of the art technique for modern day machine learning applications." "So, in the next few videos, we'll start diving into the technical details of Neural" "Networks so that you can apply them to modern-day machine learning applications and get them to work well on problems." "But for me, you know, one of the reasons the excite me is that maybe they give us this window into what we might do if we're also thinking of what algorithms might someday be able to learn in a manner similar to humankind." "In this video, I want to start telling you about how we represent Neural Networks, in other words how we represent our hypotheses or how we represent our model when using your Neural Networks." "Neural Networks were developed at simulating neurons or networks of neurons in the brain." "So, to explain the hypotheses representation." "Let's start by looking at what a single neuron in the brain looks like." "Your brain and mine is jam-packed full of neurons like these and neurons are cells in the brain and the two things to draw attention to are that first that the neuron has a cell body like so and moreover, the" "neuron has a number of input wires and these are called the dendrites who think of them as input wires and these receive inputs from other locations and the neuron also has an output wire called the axon." "And this output wire is what it uses to send signal to other neurons or to send messages to other neurons." "So, at a simplistic level, what a neuron is is a computational unit that gets a number of inputs through its input wires, does some computation." "and then it sends outputs, via its axon to other nodes or other neurons in the brain." "Here's an illustration of a group of neurons." "The way that neurons communicate with each other is with little pulses of electricities." "They're also called spikes, but they're just means of little pulse of electricity." "So, here's one neuron and what it does is if it wants to send a message, what it does is it sends the little pulse of electricity via its axon to some difference neuron and here this axon." "There is this open wire." "Connects to the input wire or connects to the dendrite of this second neuron over here, which then accepts this incoming message does some computation and may in turn decide to send out its O messages on its axon to other neurons." "And this is the process by which all human thought happens as these neurons doing computations and passing messages to other neurons as a result of what other inputs they've got." "And by the way, this is how our senses and our muscles work as well." "If you want to move one of your muscles, the way that works is that a neuron may send these pulses of electricities" "to your muscle and that causes your muscles to contract and your eyes - if some sensor like your eye wants to send a message to your brain, what it does is it sends its pulses of electricity to a neuron in your brain like so." "In a neural network, or rather in an artificial neural network that we implement in a computer, we're going to use a very simple model of what a neuron does." "We're going to model a neuron as just a logistic unit." "So, when I draw a yellow circle like that, you should hink of that as playing a role analogous to maybe the body of a neuron, and we then feed the neuron a few inputs via its dendrites or" "its input wires and the neuron does some computation and output some value on this output wire or in a biological neuron that sorts the axon and whenever I draw a diagram like this, what this means is that this represents" "a computations of, you know, h of x equals 1 over 1 + e to the negative theta transpose x where, as usual, x and theta are our parameter vectors like so." "So, this is a very simple maybe a vastly over simplified model of the computation that the neuron does where it gets the number of inputs, x1, x2, x3 and it outputs some value computed like so." "When I draw a neural network, usually I draw only the input nose x1, x2, x3," "sometimes when it's useful to do so." "I draw an extra node for x zero." "This x zero node is sometimes called the bias unit or the bias neuron but because x0 is already equal to 1." "Sometimes, I draw with it, sometimes" "I won't just depending on whether there's more the rotationally convenient for that example." "Finally, one last bit of terminology when we talk about neural networks, sometimes we'll say that this is a neuron - an artificial neuron with a sigmoid or a logistic activation function." "So this activation function in the neuronetro terminology, this is just another term for that function for that non-linearity g of z, equals 1 over 1 plus e to the negative z." "And whereas so far" "I've been calling theta the parameters of the model are mostly continued to use that terminology to conjugate to the parameters, but the neural networks." "In the neural networks literature and sometimes you might hear people talk about weights of a model and weights just means exactly the same thing as parameters of the model." "But almost to use the terminology parameters in these videos, but sometimes you may hear others use the weights terminology." "So, this little diagram represents a single neuron." "What a neural network is" "Is just a proof of these different neurons strung together." "Concretely, here we have input units x1, x2, and x3 and once again, sometimes can draw this extra node x0 or sometimes not." "So, I just draw that in here." "And here we have three neurons, which I have written, you know, a(2)1, a(2)2 and a(2)3 around top bottles indices later and once again," "we can if we want adding this a0 and add an extra bias unit there." "It always outputs the value of 1." "Then finally we have this third node at the final layer, and it's this third node that opens the value that the hypotheses h of x computes." "To introduce a bit more terminology in a neural network, the first layer, this is also called the input layer because this is where we input our features, x1 x2 x3." "The final layer is also called the output layer because that layer has the neuron - this one over here - that outputs the final value computed by a hypotheses and then layer two in between, this is called the hidden layer." "The term hidden layer isn't a great terminology, but the intuition is that, you know, in supervised learning where you get to see the inputs, and you get to see the correct outputs." "Whereas the hidden layer are values you don't get to observe in the training set." "If it's not x and it's not y and so we call those hidden." "and later on we'll see neural networks with more than one hidden layer, but in this example we have one input layer, layer 1; one hidden layer, layer 2; and one output layer, layer 3." "But basically anything that isn't an input layer and isn't a output layer is called a hidden layer." "So, I want to be really clear about what this neural network is doing." "Let's step through the computational steps that are embodied by this, represented by this diagram." "To explain the specific computations represented by a neural network, here's a little bit more notation." "I'm going to use a superscript j subscript i to denote the activation of neuron i or of unit i in layer j." "So concretely, this a superscript 2 subscript 1 does the activation of the first unit in layer 2, in our hidden layer." "And by activation, I just mean, you know, the value that is computed by and that is output by a specific." "In addition, our neural network is parametrized by these matrices, theta superscript j where our theta j is going to be a matrix of waves controlling the function mapping from one layer, maybe the first layer to the second layer or from the second layer to the third layer." "So, here are the computations that are represented by this diagram." "This first hidden unit here, has its value computed as follows: is a a(2)1 is equal to the sigmoid function, or the sigmoid activation function also called the logistic activation function," "applied to this sort of linear combination of its inputs." "And then this second hidden unit has this activation value computed as sigmoid of this." "And similarly, for this third hidden unit, it's computed by that formula." "So here, we have three input units and the three hidden units." "And so the dimension of theta one which the matrix of parameters governing our mapping from all three input units, about three hidden units theta 1 is going to" "be a 3, theta 1 is going to" "be a 3 by 4 dimensional matrix and more generally," "if a network has Sj units and their j and Sj + 1 units in their j + 1 then the matrix theta j which governs the function mapping from layer j to layer j +" "1 that we'll have to mention" "Sj + 1 by Sj + 1." "Just be clear about this notation, right?" "This is S subscript j" "+ 1 and that's S subscript j, and then this whole thing, plus 1." "Of this whole thing, that's j + 1, okay?" "So that's S subscript j plus 1 plus, by So, that's S subscript j plus" "1 by Sj" "+ 1 where as plus 1 is not part of the subscript." "So, we talked about what the three hidden units do to compute their values." "Finally, this last, the spinal in optimal layer, we have one more units which computes h of x and that's equal, can also be written as a(3)1 and that's equal to this." "And you notice that I've written this with a superscript" "2 here because theta superscript 2 is the matrix of parameters, or the matrix of weights that controls the function that maps from the hidden units, that is the layer 2 units, to the 1 layer 3 unit that is the output" "unit." "To summarize, what we've done is shown how a picture like this over here defines an artificial neural network which defines a function h that maps your x's input values to hopefully to some space and provisions y?" "And these hypotheses after parametrized by parameters that I am denoting with a capital theta so that as we be vary theta we get different hypotheses." "So we get different functions mapping say from x to y." "So this gives us a mathematical definition of how to represent the hypotheses in the neural network." "In the next few videos, what" "I'd like to do is give you more intuition about what these hypotheses representations do, as well as go through a few examples and talk about how to compute them efficiently." "In the last video, we gave a mathematical definition of how to represent or how to compute the hypotheses used by Neural Network." "In this video, I like show you how to actually carry out that computation efficiently, and that is show you a vector rise implementation." "And second, and more importantly, I want to start giving you intuition about why these neural network representations might be a good idea and how they can help us to learn complex nonlinear hypotheses." "Consider this neural network." "Previously we said that the sequence of steps that we need in order to compute the output of a hypotheses is these equations given on the left where we compute the activation values of the three hidden uses and then we use those to compute the" "final output of our hypotheses h of x." "Now, I'm going to define a few extra terms." "So, this term that I'm underlining here, I'm going to define that to be z superscript 2 subscript 1." "So that we have that a(2)1, which is this term is equal to g of z to 1." "And by the way, these superscript 2, you know, what that means is that the z2 and this a2 as well, the superscript" "2 in parentheses means that these are values associated with layer" "2, that is with the hidden layer in the neural network." "Now this term here" "I'm going to similarly define as z(2)2." "And finally, this last term here that I'm underlining," "let me define that as z(2)3." "So that similarly we have a(2)3 equals g of" "z(2)3." "So these z values are just a linear combination, a weighted linear combination, of the input values x0, x1, x2, x3 that go into a particular neuron." "Now if you look at this block of numbers," "you may notice that that block of numbers corresponds suspiciously similar" "to the matrix vector operation, matrix vector multiplication of x1 times the vector x." "Using this observation we're going to be able to vectorize this computation of the neural network." "Concretely, let's define the feature vector x as usual to be the vector of x0, x1, x2, x3 where x0 as usual is always equal" "1 and that defines z2 to be the vector of these z-values, you know, of z(2)1 z(2)2, z(2)3." "And notice that, there, z2 this is a three dimensional vector." "We can now vectorize the computation of a(2)1, a(2)2, a(2)3 as follows." "We can just write this in two steps." "We can compute z2 as theta 1 times x and that would give us this vector z2;" "and then a2 is g of z2 and just to be clear z2 here, This is a three-dimensional vector and a2 is also a three-dimensional vector and thus this activation g." "This applies the sigmoid function element-wise to each of the z2's elements." "And by the way, to make our notation a little more consistent with what we'll do later, in this input layer we have the inputs x, but we can also thing it is as in activations of the first layers." "So, if I defined a1 to be equal to x." "So, the a1 is vector, I can now take this x here and replace this with z2 equals theta1 times a1 just by defining a1 to be activations in my input layer." "Now, with what I've written so far I've now gotten myself the values for a1, a2, a3, and really" "I should put the superscripts there as well." "But I need one more value, which is I also want this a(0)2 and that corresponds to a bias unit in the hidden layer that goes to the output there." "Of course, there was a bias unit here too that, you know, it just didn't draw under here but to take care of this extra bias unit, what we're going to do is add an extra a0 superscript 2," "that's equal to one, and after taking this step we now have that a2 is going to be a four dimensional feature vector because we just added this extra, you know, a0 which is equal to" "1 corresponding to the bias unit in the hidden layer." "And finally, to compute the actual value output of our hypotheses, we then simply need to compute" "z3." "So z3 is equal to this term here that I'm just underlining." "This inner term there is z3." "And z3 is stated 2 times a2 and finally my hypotheses output h of x which is a3 that is the activation of my one and only unit in the output layer." "So, that's just the real number." "You can write it as a3 or as a(3)1 and that's g of z3." "This process of computing h of x is also called forward propagation" "and is called that because we start of with the activations of the input-units and then we sort of forward-propagate that to the hidden layer and compute the activations of the hidden layer and then we sort of forward propagate that" "and compute the activations of the output layer, but this process of computing the activations from the input then the hidden then the output layer, and that's also called forward propagation" "and what we just did is we just worked out a vector wise implementation of this procedure." "So, if you implement it using these equations that we have on the right, these would give you an efficient way or both of the efficient way of computing h of x." "This forward propagation view also helps us to understand what" "Neural Networks might be doing and why they might help us to learn interesting nonlinear hypotheses." "Consider the following neural network and let's say I cover up the left path of this picture for now." "If you look at what's left in this picture." "This looks a lot like logistic regression where what we're doing is we're using that note, that's just the logistic regression unit and we're using that to make a prediction h of x." "And concretely, what the hypotheses is outputting is h of x is going to be equal to g which is my sigmoid activation function times theta 0 times a0 is equal to 1 plus theta 1" "plus theta 2 times a2 plus theta" "3 times a3 whether values a1, a2, a3 are those given by these three given units." "Now, to be actually consistent to my early notation." "Actually, we need to, you know, fill in these superscript 2's here everywhere" "and I also have these indices 1 there because I have only one output unit, but if you focus on the blue parts of the notation." "This is, you know, this looks awfully like the standard logistic regression model, except that" "I now have a capital theta instead of lower case theta." "And what this is doing is just logistic regression." "But where the features fed into logistic regression are these values computed by the hidden layer." "Just to say that again, what this neural network is doing is just like logistic regression, except that rather than using the original features x1, x2, x3," "is using these new features a1, a2, a3." "Again, we'll put the superscripts there, you know, to be consistent with the notation." "And the cool thing about this, is that the features a1, a2, a3, they themselves are learned as functions of the input." "Concretely, the function mapping from layer 1 to layer 2, that is determined by some other set of parameters, theta 1." "So it's as if the neural network, instead of being constrained to feed the features x1, x2, x3 to logistic regression." "It gets to learn its own features, a1, a2, a3, to feed into the logistic regression and as you can imagine depending on what parameters it chooses for theta 1." "You can learn some pretty interesting and complex features and therefore you can end up with a better hypotheses than if you were constrained to use the raw features x1, x2 or x3 or if you will constrain to say choose the" "polynomial terms, you know, x1, x2, x3, and so on." "But instead, this algorithm has the flexibility to try to learn whatever features at once, using these a1, a2, a3 in order to feed into this last unit that's essentially" "a logistic regression here." "I realized this example is described as a somewhat high level and so" "I'm not sure if this intuition of the neural network, you know, having more complex features will quite make sense yet, but if it doesn't yet in the next two videos I'm going to go through a specific example" "of how a neural network can use this hidden there to compute more complex features to feed into this final output layer and how that can learn more complex hypotheses." "So, in case what I'm saying here doesn't quite make sense, stick with me for the next two videos and hopefully out there working through those examples this explanation will make a little bit more sense." "But just the point O. You can have neural networks with other types of diagrams as well, and the way that neural networks are connected, that's called the architecture." "So the term architecture refers to how the different neurons are connected to each other." "This is an example of a different neural network architecture" "and once again you may be able to get this intuition of how the second layer, here we have three heading units that are computing some complex function maybe of the input layer, and then the third layer can take the" "second layer's features and compute even more complex features in layer three so that by the time you get to the output layer, layer four, you can have even more complex features of what you are able to compute in" "layer three and so get very interesting nonlinear hypotheses." "By the way, in a network like this, layer one, this is called an input layer." "Layer four is still our output layer, and this network has two hidden layers." "So anything that's not an input layer or an output layer is called a hidden layer." "So, hopefully from this video you've gotten a sense of how the feed forward propagation step in a neural network works where you start from the activations of the input layer and forward propagate that to the first hidden layer, then the second" "hidden layer, and then finally the output layer." "And you also saw how we can vectorize that computation." "In the next, I realized that some of the intuitions in this video of how, you know, other certain layers are computing complex features of the early layers." "I realized some of that intuition may be still slightly abstract and kind of a high level." "And so what I would like to do in the two videos is work through a detailed example of how a neural network can be used to compute nonlinear functions of the input and hope that will give you a good sense of the sorts of" "complex nonlinear hypotheses we can get out of Neural Networks." "In this and the next video I want to work through a detailed example, showing how a neural network can compute a complex nonlinear function of the input and hopefully, this will give you a good sense of why" "Neural Networks can be used to learn complex, nonlinear hypotheses." "Consider the following problem where we have input features x1 and x2 that are binary values, so either zero or one." "So x1 and x2 can each take on only one of two possible values." "In this example, I've drawn only two positive examples and two negative examples, but you can think of this as a simplified version of a more complex learning problem where we may have a bunch of positive examples in the upper" "right and the lower left and a bunch of negative examples to notify the circles, and what we'd like to do is learn a nonlinear, you know," "decision boundary that we need to separate the positive and the negative examples." "So how can a neural network do this and rather than use an example on the right." "I'm going to use this, maybe easier to examine example on the left." "Concretely, what this is is really computing the target label y equals x1 XOR x2." "Or this is actually the x1 XNOR x2 function where XNOR is the alternative notation for "not x1 or x2"." "So x1, XOR or x2 - that's true only if exactly one of x1 or x2 is equal to 1." "It turns out that the specific example I'm going to use works out a little bit better if we use the XNOR example, instead." "These two are the same, of course." "It means not x1 XOR x2, and so we're going to have positive examples if either both are true or both are false and we'll have that's y equals 1, y equals 1 and we're going to have y equals 0 if" "only one of them is true and we want to figure out if we can get a neural network to fit to this sort of training set." "In order to build up to a network that fits the" "XNOR example, we're going to start to a slightly simpler one and show a network that fits the AND function." "Concretely, lets say we have inputs x1 and x2 that are again binary." "So, it's either zero or one." "And let's say our target labels y are you know, is equal to x1 and x2." "This is a logical AND." "So can we get a one unit network to compute this logical AND function?" "In order to do so, I'm going to actually draw in the bias unit as well, the plus one unit." "Now, let me just assign some values to the weights or the parameters of this network." "I am going to write down the parameters on this diagram." "Write minus 30 here plus 20 and plus" "20 and what this means is that I'm assigning a value of minus thirty to the value associated with x0." "This is plus 1 going to this unit and a parameter value of plus 20 that multiplies in x1 in a value of plus 20 for the parameter that multiplies into x2." "So, concretely, this is saying that my hypotheses h of x is equal to g of -30 + 20x1 + 20x2." "So sometimes it's just convenient to draw these weights and draw these parameters up here, you know, in the diagram of the neural network." "And of course this minus 30 this is actually theta 1" "of 1,0." "This is theta 1 of 1,1 and that's theta" "1 of 1,2 but it's just easier think about it as, you know, associating these parameters with the edges of the network." "Let's look at what this little single neuron network will compute." "Just to remind you, the sigmoid activation function g of z looks like this." "It starts from 0, rises smoothly, crosses 0.5, and then it asymptotes at one." "And to give you some landmarks, if the horizontal axis value z is equal to 4.6, then" "the sigmoid function is equal to 0.99." "This is very close to 1 and kind of symmetrically if it is negative 4.6, then the sigmoid function there is equal to 0.01 which is very close to 0." "Let's look at the four possible input values for x1 and x2 and look at whether the hypothesis will open in that case." "If x1 and x2 are both equal to 0 - if you look at this, if x1 and x2 are both equal to 0 then the hypotheses of point g of -30." "So, it's like very far to the left of this diagram." "This will be very close to 0." "If x1 equals 0 and x2 equals 1 then this formula here evaluates to g, thus the sigmoid function applied to -10 and again, that's, you know, to the far left of this plot and so," "that's again very close to 0." "This is also g of -10." "That is if x1 is equal to 1 and x(2)0, this is -30 plus 20, which is -10." "And finally if x1 equals 1, x2 equals" "1, then you have g of" "30 +20 +20, so that's g of" "+10, which is therefore very close to 1." "And if you look in this column, this is exactly the logical "and" function." "So, this is computing h of x is, you know, approximately x1 and x2." "In other words, it outputs 1 if and only if x1 and x2 are" "both equal to 1." "So by writing out our little truth table like this, we manage to figure out what's the logical function that our neural network computes." "This network shown here computes the OR function just to show you how I worked that out." "If you are to write out the hypotheses you find that it's computing g of" "10 +20 x1" "+20 x2." "And so if you fill in these values you find that's g of" "10 which is approximately 0, g of 10 which is approximately 1, and so on." "These are approximately 1, and approximately 1, and these numbers is essentially the logical OR function." "So, hopefully with this, you now understand how single neurons in a neural network can be used to compute logical functions like AND and OR and so on." "In the next video, we'll continue building on these examples and work through a more complex example." "We'll get to show you how a neural network, now with multiple layers of units can be used to compute more complex functions like the XOR function or the XNOR function." "In this video, I'd like you to work in through our example to show how a neural network can compute complex nonlinear hypotheses." "In the last video, we saw how a neural network can be used to compute the functions x1 and x2 and the function x1 or x2 when x1 and x2 are binary." "That is, when they take on values of 0, 1." "We can also have a network to compute negation, that is to compute the function "not x1"." "Let me just write down the ways associated with this network." "We have only one input feature, x1 in this case and the bias unit plus 1 and if I associate this with the weights +10 and" "20 then my hypotheses is computing this." "H of x equals sigmoid of 10 minus 20 times x1 so when x1 is equal to 0, my hypothesis will be computing" "g of 10 minus 20 times 0 which is just 10." "And so that's approximately 1, and when is x is equal to 1 this will be g of -10, which is therefore approximately equal to 0." "And if you look at what these values are, that's essentially the "not x1" function." "So to include negations, the general idea is to put a large negative weight in front of the variable you want to negate." "So if it's -20, multiplied by x1 and, you know, that's the general idea of how you end up negating x1." "And so, in an example that" "I hope you will figure out yourself, if you want to compute a function like this:" ""not x1 and not x2"" "you know, while part of that would probably be putting large negative weights in front of x1 and x2, but it should be feasible to get a neural network with just one output unit to compute this as well." "All right?" "So, this large fill function "not x1 and not x2" is going to be equal to 1 if, and only if, x1 equals x2 equals zero, right?" "So this is a logical function, this is not x1, that means x1 must be zero and not x2." "That means x2 must be equal to zero as well." "So this logical function is equal to 1 if, and only if, both x1 and x2 are equal to zero." "And hopefully, you should be able to figure out how to make a small neural network to compute this logical function as well." "Now, taking the three pieces that we have put together, the network for computing x1 and x2 and the network for computing not x1 and not x2 and one last network for computing x1 or x2, we should be" "able to put these three pieces together to compute this x1, XNOR x2 function." "And just to remind you, if this was x1, x2, this function that we want to compute would have negative examples here and here and we'd have positive examples there and there." "And so clearly this, you know, we'll need a nonlinear decision boundary in order to separate the positive and negative examples." "Let's draw the network." "I'm going to take my input plus 1, x1, x2, and create my first hidden unit here." "I'm going to call this a(2)1 because that's my first hidden unit." "And I'm going to copy the weights over from the Red" "Network, x1 and x2 networks." "So now minus 30, 20, 20." "Next, let me create a second hidden unit, which" "I'm going to call a(2)2 that is the second hidden unit of layer two." "And I'm going to copy over the" "Cyan Network in the middle, so I'm going to have the weights 10, minus 20, minus 20." "And so, let's pull some of the truth table values." "For the Red Network, we know that was computing the x1 and x2." "And so this is going to be approximately 0, 0," "0, 1, depending on the values of x1 and x2." "And for a (2)2, that's the Cyan Network." "Well that we know the function not x1 and not x2 then outputs 1," "0, 0, 0 for the 4 values of x1 and x2." "Finally, I'm going to create my output note, my output unit that is a(3)1." "This is one more output h of x and I'm going to copy over the OR" "Network for that and I'm going to need a plus one bias unit here." "So, draw that in and I'm going to copy over the weights from the Green Networks." "So, it's minus 10, 20, 20 and we know earlier that this computes the OR function." "So, let's go on the truth table entries." "For the first entry is 0 or 1, which is gonna be" "1 then next 0 or 0, which is 0," "0, or 0, which is 0, 1 or 0 and that all is to 1 and thus, h of x is equal to 1 when either both x1 and x2 are 0 or when x1 and x2 are both 1." "And concretely, h of x outputs 1 exactly at these two locations and it outputs" "0 otherwise and thus, with this neural network, which has an input layer, one hidden layer and one output layer, we end up with a nonlinear decision boundary that computes this XNOR function." "And the more general intuition is that in the input layer, we just had our raw inputs then we had a hidden layer, which computed some slightly more complex functions of the inputs that is shown here, these are slightly more" "complex functions, and then by adding yet another layer, we end up with an even more complex nonlinear function." "And this is the sort of intuition about why Neural" "Networks can compute pretty complicated functions that when you have multiple layers, you have, you know, relatively simple function of the inputs, and the second layer, but the third layer can build on that to compute even more" "complex functions and then the layer after that can compute even more complex functions." "To wrap up this video, I want to show you a fun example of an application of a neural network that captures this intuition of the deeper layers computing more complex features." "I want to show you a video that I got from a good friend of mine, Yon Khun." "Yon is a professor at" "New York University, at NYU, and he was one of the early pioneers of neural network research and sort of a legend in the field now and his ideas are used in all sorts of products and applications throughout the world now." "So, I want to show you a video from some of his early work in which he was using a neural network to recognize handwriting - to do handwritten digit recognition." "You might remember early in this class, at the start of this class, I said that one of early successes of neural networks was trying to use it to read zip codes, to help us, you know, send mail along." "So, to read postal codes." "So, this is one of the attempts." "So, this is one of the algorithms used to try to address that problem." "In the video I'll show you this area here is the input area that shows a handwritten character shown to the network." "This column here shows a visualization of the features computed by so that the first hidden layer of the network and so the first hidden layer, you know, this visualization shows different features, different edges and lines and so on detected." "This is a visualization of the next hidden layer." "It's kind of harder to see how to understand deeper hidden layers and that's the visualization of what the next hidden layer is computing." "You probably have a hard time seeing what's going on, you know, much beyond the first hidden layer, but then finally, all of these learned features get fed to the output layer and shown over here is" "the final answers, the final predictive value for what handwritten digit the neural network things that is being shown." "So, let's take a look at the video." "So, I hope" "you enjoyed the video and that this hopefully gave you some intuition about the sorts of pretty complicated functions neural networks can learn in which it takes this input this image, just takes this input the raw pixels and the first" "end of the layer computes some set of features, the next end of the layer computes even more complex features and even more complex features and these features can then be used by essentially the final layer of logistic regression classifiers" "to make accurate predictions about what are the numbers that the network sees." "In this video we'll talk about an approach to building a recommender system that's called collaborative filtering." "The algorithm that we're talking about has a very interesting property that it does what is called feature learning and by that I mean that this will be an algorithm that can start to learn for itself what features to use." "Here was the data set that we had and we had assumed that for each movie, someone had come and told us how romantic that movie was and how much action there was in that movie." "But as you can imagine it can be very difficult and time consuming and expensive to actually try to get someone to, you know, watch each movie and tell you how romantic each movie and how action packed is each" "movie, and often you'll want even more features than just these two." "So where do you get these features from?" "So let's change the problem a bit and suppose that we have a data set where we do not know the values of these features." "So we're given the data set of movies and of how the users rated them, but we have no idea how romantic each movie is and we have no idea how action packed each movie is so I've replaced all of these things with question marks." "But now let's make a slightly different assumption." "Let's say we've gone to each of our users, and each of our users has told has told us how much they like the romantic movies and how much they like action packed movies." "So Alice has associated a current of theta 1." "Bob theta 2." "Carol theta 3." "Dave theta 4." "And let's say we also use this and that Alice tells us that she really likes romantic movies and so there's a five there which is the multiplier associated with X1 and lets say that Alice tells us she" "really doesn't like action movies and so there's a 0 there." "And Bob tells us something similar so we have theta 2 over here." "Whereas Carol tells us that she really likes action movies which is why there's a 5 there, that's the multiplier associated with X2, and remember there's also" "X0 equals 1 and let's say that Carol tells us she doesn't like romantic movies and so on, similarly for Dave." "So let's assume that somehow we can go to users and each user J just tells us what is the value of theta J for them." "And so basically specifies to us of how much they like different types of movies." "If we can get these parameters theta from our users then it turns out that it becomes possible to try to infer what are the values of x1 and x2 for each movie." "Let's look at an example." "Let's look at movie 1." "So that movie 1 has associated with it a feature vector x1." "And you know this movie is called Love at last but let's ignore that." "Let's pretend we don't know what this movie is about, so let's ignore the title of this movie." "All we know is that Alice loved this move." "Bob loved this movie." "Carol and Dave hated this movie." "So what can we infer?" "Well, we know from the feature vectors that Alice and Bob love romantic movies because they told us that there's a 5 here." "Whereas Carol and Dave, we know that they hate romantic movies and that they love action movies." "So because those are the parameter vectors that you know, uses 3 and 4, Carol and Dave, gave us." "And so based on the fact that movie 1 is loved by Alice and Bob and hated by Carol and Dave, we might reasonably conclude that this is probably a romantic movie, it is probably not much of an action movie." "this example is a little bit mathematically simplified but what we're really asking is what feature vector should X1 be so that theta 1 transpose x1 is approximately equal to 5, that's Alice's rating, and theta 2 transpose x1 is" "also approximately equal to 5, and theta 3 transpose x1 is approximately equal to 0, so this would be Carol's rating, and" "theta 4 transpose X1 is approximately equal to 0." "And from this it looks like, you know, X1 equals one that's the intercept term, and then 1.0, 0.0, that makes sense given what we know of Alice," "Bob, Carol, and Dave's preferences for movies and the way they rated this movie." "And so more generally, we can go down this list and try to figure out what might be reasonable features for these other movies as well." "Let's formalize this problem of learning the features XI." "Let's say that our users have given us their preferences." "So let's say that our users have come and, you know, told us these values for theta 1 through theta of NU and we want to learn the feature vector XI for movie number I. What we can" "do is therefore pose the following optimization problem." "So we want to sum over all the indices J for which we have a rating for movie I because we're trying to learn the features for movie I that is this feature vector XI." "So and then what we want to do is minimize this squared error, so we want to choose features XI, so that, you know, the predictive value of how user J rates movie" "I will be similar, will be not too far in the squared error sense of the actual value YIJ that we actually observe in the rating of user j" "on movie I. So, just to summarize what this term does is it tries to choose features" "XI so that for all the users J that have rated that movie, the algorithm also predicts a value for how that user would have rated that movie that is not too far, in the squared error sense, from the" "actual value that the user had rated that movie." "So that's the squared error term." "As usual, we can also add this sort of regularization term to prevent the features from becoming too big." "So this is how we would learn the features for one specific movie but what we want to do is learn all the features for all the movies and so what" "I'm going to do is add this extra summation here so" "I'm going to sum over all Nm movies, N subscript m movies, and minimize this objective on top that sums of all movies." "And if you do that, you end up with the following optimization problem." "And if you minimize this, you have hopefully a reasonable set of features for all of your movies." "So putting everything together, what we, the algorithm we talked about in the previous video and the algorithm that we just talked about in this video." "In the previous video, what we showed was that you know, if you have a set of movie ratings, so if you have the data the rij's and then you have the yij's that will be the movie ratings." "Then given features for your different movies we can learn these parameters theta." "So if you knew the features, you can learn the parameters theta for your different users." "And what we showed earlier in this video is that if your users are willing to give you parameters, then you can estimate features for the different movies." "So this is kind of a chicken and egg problem." "Which comes first?" "You know, do we want if we can get the thetas, we can know the Xs." "If we have the Xs, we can learn the thetas." "And what you can do is, and then this actually works, what you can do is in fact randomly guess some value of the thetas." "Now based on your initial random guess for the thetas, you can then go ahead and use the procedure that we just talked about in order to learn features for your different movies." "Now given some initial set of features for your movies you can then use this first method that we talked about in the previous video to try to get an even better estimate for your parameters theta." "Now that you have a better setting of the parameters theta for your users, we can use that to maybe even get a better set of features and so on." "We can sort of keep iterating, going back and forth and optimizing theta, x theta, x theta, nd this actually works and if you do this, this will actually cause your album to converge to a reasonable set of" "features for you movies and a reasonable set of parameters for your different users." "So this is a basic collaborative filtering algorithm." "This isn't actually the final algorithm that we're going to use." "In the next video we are going to be able to improve on this algorithm and make it quite a bit more computationally efficient." "But, hopefully this gives you a sense of how you can formulate a problem where you can simultaneously learn the parameters and simultaneously learn the features from the different movies." "And for this problem, for the recommender system problem, this is possible only because each user rates multiple movies and hopefully each movie is rated by multiple users." "And so you can do this back and forth process to estimate theta and x." "So to summarize, in this video we've seen an initial collaborative filtering algorithm." "The term collaborative filtering refers to the observation that when you run this algorithm with a large set of users, what all of these users are effectively doing are sort of collaboratively--or collaborating to get better movie ratings for everyone because with every" "user rating some subset with the movies, every user is helping the algorithm a little bit to learn better features," "and then by helping-- by rating a few movies myself, I will be helping" "the system learn better features and then these features can be used by the system to make better movie predictions for everyone else." "And so there is a sense of collaboration where every user is helping the system learn better features for the common good." "This is this collaborative filtering." "And, in the next video what we going to do is take the ideas that have worked out, and try to develop a better an even better algorithm, a slightly better technique for collaborative filtering." "In this video, I want to tell you about how to use neural networks to do multiclass classification where we may have more than one category that we're trying to distinguish amongst." "In the last part of the last video, where we had the handwritten digit recognition problem, that was actually a multiclass classification problem because there were ten possible categories for recognizing the digits from" "0 through 9 and so, if you want us to fill you in on the details of how to do that." "The way we do multiclass classification in a neural network is essentially an extension of the one versus all method." "So, let's say that we have a computer vision example, where instead of just trying to recognize cars as in the original example that I started off with, but let's say that we're trying to recognize, you know, four" "categories of objects and given an image we want to decide if it is a pedestrian, a car, a motorcycle or a truck." "If that's the case, what we would do is we would build a neural network with four output units so that our neural network now outputs a vector of four numbers." "So, the output now is actually needing to be a vector of four numbers and what we're going to try to do is get the first output unit to classify: is the image a pedestrian, yes or no." "The second unit to classify: is the image a car, yes or no." "This unit to classify: is the image a motorcycle, yes or no, and this would classify:" "is the image a truck, yes or no." "And thus, when the image is of a pedestrian, we would ideally want the network to output 1, 0, 0, 0, when it is a car we want it to output" "0, 1, 0, 0, when this is a motorcycle, we get it to or rather, we want it to output 0, 0," "1, 0 and so on." "So this is just like the "one versus all" method that we talked about when we were describing logistic regression, and here we have essentially four logistic regression classifiers, each of which is trying to recognize one" "of the four classes that we want to distinguish amongst." "So, rearranging the slide of it, here's our neural network with four output units and those are what we want h of x to be when we have the different images, and the way we're going to represent the" "training set in these settings is as follows." "So, when we have a training set with different images of pedestrians, cars, motorcycles and trucks, what we're going to do in this example is that whereas previously we had written out the labels as y being an integer from" "1, 2, 3 or 4." "Instead of representing y this way, we're going to instead represent y as follows: namely Yi" "will be either 1, 0, 0, 0 or 0, 1, 0, 0 or 0, 0, 1, 0 or 0, 0, 0, 1 depending on what the corresponding image Xi is." "And so one training example will be one pair Xi colon Yi" "where Xi is an image with, you know one of the four objects and" "Yi will be one of these vectors." "And hopefully, we can find a way to get our" "Neural Networks to output some value." "So, the h of x is approximately y and both h of x and Yi, both of these are going to be in our example, four dimensional vectors when we have four classes." "So, that's how you get neural network to do multiclass classification." "This wraps up our discussion on how to represent Neural Networks that is on our hypotheses representation." "In the next set of videos, let's start to talk about how take a training set and how to automatically learn the parameters of the neural network." "Neural Networks are one of the most powerful learning algorithms that we have today." "In this, and in the next few videos, I'd like to start talking about a learning algorithm for fitting the parameters of the neural network given the training set." "As for the discussion of most of the learning algorithms, we're going to begin by talking about the cost function for fitting the parameters of the network." "I'm going to focus on the application of neural networks to classification problems." "So, suppose we have a network like that shown on the left." "And suppose we have a training set like this of this of" "Xi, Yi pairs of m training examples." "I'm going to use upper case" "L to denote the total number of layers in this network." "So, for the network shown on the left, we would have capital L equals 4." "And, I'm going to use s subscript L, to denote the number of units, that is a number of neurons, not counting the bias unit in layer L of the network." "So, for example, we would have a S1 which is the input layer equals S3 unit," "S2 in my example is five units." "And the output layer S4." "Which is also equals SL, because capital L is equal to four." "The output layer in my example in the left has four units." "We're going to consider two types of classification problems." "The first is binary classification, where the labels y are either zero or one." "In this case, we would have one output unit." "So, this neural network on top has four output units, but if we had binary classification, we would have only one output unit that computes h of x." "And the outputs of the neural network would be h of x is going to be a real number." "And in this case the number of output units, SL, where" "L is again the index of the final layer because that's the number of layers we have in the network." "So the number of units we have in the output layer is going to be equal to one." "In this case, to simplify notation later, I'm also going to set k equals 1." "So, you can think of k as also denoting the number of units in the output layer." "The second type of classification problem we'll consider will be multiclass classification problem where we may have k distinct classes." "So, our early example, I had this representation for y if we have four classes and in this case, we would have caps lock K output units and our hypotheses will output vectors that are K dimensional." "And the number of output units will be equal to K." "And usually we will have" "K greater than or equal to three in this case, because" "if we had two classes then, you know, we don't need to use the one versus all method." "We need to use the one versus all method only if we have K greater than or equal to three classes so we only have two classes we will need to use only one output unit." "Now, let's define the cost function for our cost function for our neural network." "The cost function we use for the neural network is going to be a generalization of the one that we use for logistic regression." "For logistic regression, we used to minimize the cost function j of theta that was minus 1 over m of this cost function and then plus this extra regularization term here, where this was a sum from j equals" "1 through n, because we did not regularize the bias term theta zero." "For a neural network our cost function is going to be a generalization of this." "Where instead of having basically just one logistic regression output unit, we may instead have K of them." "So here's our cost function." "Neural network now outputs vectors in RK where K might be equal to 1 if we have the binary classification problem." "I'm going to use this notation, h of x subscript i, to denote the ith output." "That is h of x is a K dimensional vector." "And so this subscript i just selects out the ith element of the vector that is output by my neural network." "My cost function, j of theta is now going to be the following is minus one over m of a sum of a similar term to what we have in logistic regression." "Except that we have this sum from K equals one through K. The summation is basically a sum over my K output unit." "So, if I have four upper units." "That is the final layer of my neural network has four output units then this sum from, this is a sum from" "K equals one through four of basically the logistic regression algorithms" "cost function but summing that cost function over each of my four output units in turn." "And so, you notice in particular that this applies to YK, HK, because we're basically taking the K upper unit and comparing that to the value of YK, which is you know, which is that one of those vectors" "to say what cause it should be." "And finally, the second term here is the regularization term similar to what we had for logistic regression." "This summation terms looks really complicated and always doing is a summing over these terms, theta j i l for all values of i j and I." "Except that we don't sum over the terms corresponding to these bias values like we had for logistic progression." "Concretely, we don't sum over the terms corresponding to where i is equal to zero." "So, that is because when we are computing the activation of the neuron, we have terms like these, you know theta, i0 plus theta, i1, x1 plus, and so on, where I guess" "we could have a 2 there if this is the first hidden layer, and so the values with the 0 there at that corresponds to something that multiplies into an x0 or an a0 and so, this is kind of like" "a bias unit and by analogy to what we were doing for logistic progression, we won't sum over those terms in our regularization term because we don't want to regularize them and string the values 0." "But this is just one possible convention and even if you were to sum over, you know, i equals 0 up to SL, it will work about the same and it doesn't make a big difference." "But maybe this convention of not regularizing the bias term is just slightly more common." "So, that's the cost function we're going to use to fill on your own network." "In the next video, we'll start to talk about an algorithm for trying to optimize the cost function." "In the previous video, we talked about a cost function for the neural network." "In this video, let's start to talk about an algorithm, for trying to minimize the cost function." "In particular, we'll talk about the back propagation algorithm." "Here's the cost function that we wrote down in the previous video." "What we'd like to do is try to find parameters theta to try to minimize j of theta." "In order to use either gradient descent or one of the advance optimization algorithms." "What we need to do therefore is to write code that takes this input the parameters theta and computes j of theta and these partial derivative terms." "Remember, that the parameters in the the neural network of these things, theta superscript l subscript ij, that's the real number and so, these are the partial derivative terms we need to compute." "In order to compute the cost function j of theta, we just use this formula up here and so, what" "I want to do for the most of this video is focus on talking about how we can compute these partial derivative terms." "Let's start by talking about the case of when we have only one training example, so imagine if you will that our entire training set comprises only one training example which is a pair xy." "I'm not going to write x1y1 just write this." "Write a one training example as xy and let's tap through the sequence of calculations we would do with this one training example." "The first thing we do is we apply four propagation in order to compute whether a hypotheses actually outputs given the input." "Concretely, the called the a1 is the activation values of this first layer that was the input there." "So, I'm going to set that to x and then we're going to compute z2 equals theta 1 a1 and a2 equals g, the sigmoid activation function applied to z2 and this would give us our activations for the first middle layer." "That is for layer two of the network and we also add those bias terms." "Next we apply 2 more steps of this four and propagation to compute a3 and a4" "which is also the upwards of a hypotheses h of x." "So this is our vectorized implementation of forward propagation and it allows us to compute the activation values for all of the neurons in our neural network." "Next, in order to compute the derivatives, we're going to use an algorithm called back propagation." "The intuition of the back propagation algorithm is that for each note we're going to compute the term delta superscript l subscript j that's going to somehow represent the error of note j in the layer l." "So, recall that a superscript l subscript j that does the activation of the j of unit in layer l and so, this delta term is in some sense going to capture our error in the activation of that neural duo." "So, how we might wish the activation of that note is slightly different." "Concretely, taking the example neural network that we have on the right which has four layers." "And so capital L is equal to 4." "For each output unit, we're going to compute this delta term." "So, delta for the j of unit in the fourth layer is equal to" "just the activation of that unit minus what was the actual value of 0 in our training example." "So, this term here can also be written h of x subscript j, right." "So this delta term is just the difference between when a hypotheses output and what was the value of y in our training set whereas y subscript j is the j of element of the vector value y in our labeled training set." "And by the way, if you think of delta a and y as vectors then you can also take those and come up with a vectorized implementation of it, which is just delta 4 gets set as a4 minus y." "Where here, each of these delta 4 a4 and y, each of these is a vector whose dimension is equal to the number of output units in our network." "So we've now computed the era term's delta" "4 for our network." "What we do next is compute the delta terms for the earlier layers in our network." "Here's a formula for computing delta 3 is delta 3 is equal to theta 3 transpose times delta 4." "And this dot times, this is the element y's multiplication operation" "that we know from MATLAB." "So delta 3 transpose delta 4, that's a vector; g prime z3 that's also a vector and so dot times is in element y's multiplication between these two vectors." "This term g prime of z3, that formally is actually the derivative of the activation function g evaluated at the input values given by z3." "If you know calculus, you can try to work it out yourself and see that you can simplify it to the same answer that I get." "But I'll just tell you pragmatically what that means." "What you do to compute this g prime, these derivative terms is just a3 dot times1 minus A3 where A3 is the vector of activations." "1 is the vector of ones and A3 is again the activation the vector of activation values for that layer." "Next you apply a similar formula to compute delta 2 where again that can be computed using a similar formula." "Only now it is a2 like so and I then prove it here but you can actually, it's possible to prove it if you know calculus that this expression is equal to mathematically, the derivative of the g function of the activation" "function, which I'm denoting by g prime." "And finally, that's it and there is no delta1 term, because the first layer corresponds to the input layer and that's just the feature we observed in our training sets, so that doesn't have any error associated with that." "It's not like, you know, we don't really want to try to change those values." "And so we have delta terms only for layers 2, 3 and for this example." "The name back propagation comes from the fact that we start by computing the delta term for the output layer and then we go back a layer and compute the delta terms for the third hidden layer and then we go back another step to compute" "delta 2 and so, we're sort of back propagating the errors from the output layer to layer 3 to their to hence the name back complication." "Finally, the derivation is surprisingly complicated, surprisingly involved but if you just do this few steps steps of computation it is possible to prove viral frankly some what complicated mathematical proof." "It's possible to prove that if you ignore authorization then the partial derivative terms you want" "are exactly given by the activations and these delta terms." "This is ignoring lambda or alternatively the regularization" "term lambda will equal to 0." "We'll fix this detail later about the regularization term, but so by performing back propagation and computing these delta terms, you can, you know, pretty quickly compute these partial derivative terms for all of your parameters." "So this is a lot of detail." "Let's take everything and put it all together to talk about how to implement back propagation" "to compute derivatives with respect to your parameters." "And for the case of when we have a large training set, not just a training set of one example, here's what we do." "Suppose we have a training set of m examples like that shown here." "The first thing we're going to do is we're going to set these delta l subscript i j." "So this triangular symbol?" "That's actually the capital Greek alphabet delta ." "The symbol we had on the previous slide was the lower case delta." "So the triangle is capital delta." "We're gonna set this equal to zero for all values of I i j." "Eventually, this capital delta l i j will be used" "to compute the partial derivative term, partial derivative respect to theta I i j of" "J of theta." "So as we'll see in a second, these deltas are going to be used as accumulators that will slowly add things in order to compute these partial derivatives." "Next, we're going to loop through our training set." "So, we'll say for i equals 1 through m and so for the i iteration, we're going to working with the training example xi, yi." "So the first thing we're going to do is set a1 which is the activations of the input layer, set that to be equal to xi is the inputs for our i training example, and then we're going to perform forward propagation to" "compute the activations for layer two, layer three and so on up to the final layer, layer capital L. Next, we're going to use the output label yi from this specific example we're looking at to compute the error" "term for delta L for the output there." "So delta L is what a hypotheses output minus what the target label was?" "And then we're going to use the back propagation algorithm to compute delta L minus 1, delta L minus 2, and so on down to delta 2 and once again there is now delta 1 because we don't associate an error term with the input layer." "And finally, we're going to use these capital delta terms to accumulate these partial derivative terms that we wrote down on the previous line." "And by the way, if you look at this expression, it's possible to vectorize this too." "Concretely, if you think of delta ij as a matrix, indexed by subscript ij." "Then, if delta L is a matrix we can rewrite this as delta L, gets updated as delta L plus" "lower case delta L plus one times aL transpose." "So that's a vectorized implementation of this that automatically does an update for all values of i and j." "Finally, after executing the body of the four-loop we then go outside the four-loop and we compute the following." "We compute capital D as follows and we have two separate cases for j equals zero and j not equals zero." "The case of j equals zero corresponds to the bias term so when j equals zero that's why we're missing is an extra regularization term." "Finally, while the formal proof is pretty complicated what you can show is that once you've computed these D terms, that is exactly the partial derivative of the cost function with respect to each of your perimeters and so you" "can use those in either gradient descent or in one of the advanced authorization" "algorithms." "So that's the back propagation algorithm and how you compute derivatives of your cost function for a neural network." "I know this looks like this was a lot of details and this was a lot of steps strung together." "But both in the programming assignments write out and later in this video, we'll give you a summary of this so we can have all the pieces of the algorithm together so that you know exactly what you need" "to implement if you want to implement back propagation to compute the derivatives of your neural network's cost function with respect to those parameters." "In the previous video, we talked about the back propagation algorithm." "To a lot of people seeing it for the first time, the first impression is often that wow, this is a very complicated algorithm and there are all these different steps." "And I'm not quite sure how they fit together and its like kind of a black box with all these complicated steps." "In case that 's how you are feeling about back propagation, that's actually okay." "Back propagation may be unfortunately is a less mathematically clean or less mathematically simple algorithm compared to linear regression or logistic regression, and I've actually used back propagation, you know, pretty successfully for many years and even today, I still don't sometimes" "feel like I have a very good sense of just what it's doing most of intuition about what background propagation is doing." "For those of you that are doing the programming exercises that will at least mechanically step you through the different steps of how to implement back prop so you will be able to get it to work for yourself." "And what I want to do in this video is look a little bit more at the mechanical steps of back propagation and try to give you a little more intuition about what the mechanical steps of back prop is doing to hopefully convince you" "that, you know, it is at least a reasonable algorithm." "In case even after this video, in case back propagation still seems very black box and kind of like, you know, too many complicated steps, a little bit magical to to you, that's actually okay." "And even though, you know, I have used back prop for many years, sometimes it's a difficult algorithm to understand." "But hopefully this video will help a little bit." "In order to better understand back propagation, let's take another closer look at what forward propagation is doing." "Here's the neural network with two input units that is not counting the bias unit, and two hidden units in this layer and two hidden units in the next layer, and then finally one output unit." "And again, these counts 2, 2, 2 are not counting these bias units on top." "In order to illustrate forward propagation, I'm going to draw this network a little bit differently." "And in particular, I'm going to draw this neural network with the nodes drawn as these very fat ellipses, so that I can write text in them." "When performing forward propagation, we might have some particular example, say some example x(i) comma y(i) and it will be this x(i) that we feed into the input layer, so" "that this may be, x(i)1 and x(i)2 are the values we set the input layer to and when we forward propagate it to the first hidden layer here, what we do is compute z(2)1 and" "z(2)2, so these are the weighted sum of inputs of the input units and then we apply the sigmoid of the logistic function and the" "sigmoid activation function applied to the z value, gives us these activation values." "So that gives us a(2)1 and a(2)2, and then we forward propagate again to get," "you know, here z(3)1, apply the sigmoid of the logistic function, the activation function to that, to get the" "31 and similarly Like so, until we get z(4)1, apply the activation function this gives us a(4)1 which is the finer output value of the network." "Let's erase this arrow to give myself some space, and if you look at what this computation really is doing, focusing on this hidden unit lets say we have that this weight, shown in magenta there, is my" "weight theta 2(1)0 the indexing is not important, and this way here which I guess" "I am highlighting in red, that is theta 2(1)1 and" "this weight here, which I'm drawing in green, in a cyan, is theta 2(1)2 so the way the computers value z(3)1 is z(3)1 is as" "equal to this Weight, times this value so that's" "theta 2(1)0 times 1, plus this red weight times this value, so that's theta 2(1)1 times" "a(2)1, and finally this cyan red times this value, which is therefore, plus theta" "2(1)2 times a(2)1." "And so that's forward propagation." "And it turns out that, as we see later on in this video, what back propagation is doing, is doing a process very similar to this, except that instead of the computations flowing from the left to the right of this network," "the computations is there flow from the right to the left of the network, and using a very similar computation as this, and I'll say in two slides exactly what I mean by that." "To better understand what back propagation is doing, let's look at the cost function, it's just the cost function that we had for when we have only one output unit." "If we have more than one output unit, we just have a summation, you know, over the output units index, if only one output unit then this is a cost function operation and we do forward propagation and back propagation on one example at a time." "So, let's just focus on the single example x(i)y(i), and focus on the case of having one output unit so y(i) here is just a real number, and" "let's ignore regularization, so lambda equals zero, and this final term, that regularization term goes away." "Now, if you look inside this summation, you find that the cost term associated with the I'f training example, that is, the cost associated with training example x(i)y(i), that's" "going to be given by this expression, that the cost, sort of, of training example i is written as follows." "And what this cost function does, is it plays a role similar to the square error." "So, rather than looking at this complicated expression, if you want you can think of cos of i being approximately, you know, the square of difference between or the neural network outputs versus what is the actual value." "Just as in logistic regression, we actually prefer to use this slightly more complicated cost function using the log, but for the purpose of intuition, feel free to think of the cost function as being sort of the squared" "error cost function, and so this cos of i measures how well is the network doing on correctly predicting example i." "How close is the output to the actually observed label y(i)." "Now let's look at what back propagation is doing." "One useful intuition is that back propagation is computing these delta superscript l subscript j terms, and we can think of these as the quote error of the activation value that we got for unit j in the layer, in the" "ith layer." "More formally, and this is maybe only for those of you that are familiar with calculus," "more formally, what the delta terms actually are is this:" "they're the partial derivative with respect to z(l)j, that is the weighted sum of inputs that we're computing the z terms, partial derivative respect of these things of the cost function." "So concretely the cost function is a function of the label y and of the value, this h of x output value neural network." "And if we could go inside the neural network and just change those z(l)j values a little bit, then that would affect these values that the neural net." "And so that will end up changing the cost function." "And again really this is only for those of you expert in calculus." "If you are familiar with comfortable with partial derivatives." "What these delta terms are, is they're, they turn out to be the partial derivative of the cos function with respect to these intermediate terms that we're computing." "And so their measure of how much would we like to change the neural network's weights in order to affect these intermediate values of the computation, so as to affect the final output the neural network h of x and" "therefore affect the overall cost." "In case this last part of this partial derivative intuition, in case that didn't make sense, don't worry about it, the rest of this we can do without really talking partial derivatives but let's look in more detail at what" "back propagation is doing." "For the output layer, if first sets this delta term, we say delta 4(1), as y(i) if we're doing forward propagation and back propagation on this training example i." "It says it's y(i) minus a(4)1, so it's really the error, it's the difference between the actual value of y minus what was the value predicted." "And so we're going to compute delta 4(1) like so." "Next we're going to do propagate these values backwards." "I explain this in a second and end up computing the delta terms of the previous layer." "We're going to end up with delta 3(1); delta 3(2);" "and then we're going to propagate this further backward and end up computing delta 2(1) and delta 2(2)." "Now the back propagation calculation is a lot like running the forward propagation algorithm, but doing it backwards." "So here's what I mean." "Let's look at how we end up with this value of Delta 2(2)." "So we have Delta 2(2) and similar to forward propagation, let me label a couple of the weights." "So this weight should be one cyan--let's say that weight is theta 2 of 1, 2 and this weight down here, let me highlight this in red." "That's going to be, let's say, theta 2 of 2, 2." "So if we look at how Delta 2(2) is computed." "How it's computed for this note." "It turns out that what we're going to do is we're going to take this value and multiply it by this weight and add it to this value multiplied by that weight." "So it's really a weighted sum of the new, these delta values." "weighted by the corresponding edge strength." "So concretely, let me fill this in." "This delta 2,2 is going to be equal to theta 2(1)2, which is that magenta weight, times delta 3(1) plus, and then the thing I have in red, that's" "theta 2(2)2 times Delta 3(2)." "So it is really, literally this red weight times this value, plus this magenta weight times it's value and that's how we wind up with that value of delta." "And just as another example, let's look at this value." "How did we get that value?" "Well, it's a similar process, if this weight," "which I'm going to highlight in green, if this weight is equal to, say, delta" "3(1)2, then we have that, delta 3(2) is going to be equal to that green weight, theta 3(1)2 times delta 4(1)." "And by the way, so far I've been writing the delta values only for the hidden units and not, but not, excluding the bias units." "Depending on how you define the back propagation algorithm or depending on how you implement it, you know, you may end up implementing something to compute delta values for these bias units as well." "The bias unit is always output the values plus one and they are just what they are and there's no way for us to change the value and so, depending on your implementation of back prop, the way I usually implement it," "I do end up computing these delta values, but we just discard them and we don't use them, because they don't end up being part of the calculation needed to compute the derivatives." "So, hopefully, that gives you a little bit of intuition about what back propagation is doing." "In case of all this, they still seem so magical and so black box, in a later video, in the putting it together video, I'll try to give a little more intuition about what that back propagation is doing." "But, unfortunately, this is, you know, a difficult algorithm to try to visualize and understand what it is really doing." "But fortunately, you know, often I guess, many people have been using it very successfully for many years and if you infer the algorithm, you have a very effective learning algorithm, even though the inner workings of exactly" "how it works can be harder to visualize." "In the previous video, we talked about how to use back propagation" "to compute the derivatives of your cost function." "In this video, I want to quickly tell you about one implementational detail of unrolling your parameters from matrices into vectors, which we need in order to use the advanced optimization routines." "Concretely, let's say you've implemented a cost function that takes this input, you know, parameters theta and returns the cost function and returns derivatives." "Then you can pass this to an advanced authorization algorithm by fminunc and fminunc isn't the only one by the way." "There are also other advanced authorization algorithms." "But what all of them do is take those input pointedly the cost function, and some initial value of theta." "And both, and these routines assume that theta and the initial value of theta, that these are parameter vectors, maybe" "Rn or Rn plus 1." "But these are vectors and it also assumes that, you know, your cost function will return as a second return value this gradient which is also Rn and Rn plus 1." "So also a vector." "This worked fine when we were using logistic progression but now that we're using a neural network our parameters are no longer vectors, but instead they are these matrices where for a full neural network we would have parameter matrices theta 1, theta 2, theta 3" "that we might represent in Octave as these matrices theta 1, theta 2, theta 3." "And similarly these gradient terms that were expected to return." "Well, in the previous video we showed how to compute these gradient matrices, which was capital D1, capital D2, capital D3, which we might represent an octave as matrices D1, D2, D3." "In this video I want to quickly tell you about the idea of how to take these matrices and unroll them into vectors." "So that they end up being in a format suitable for passing into as theta here off for getting out for a gradient there." "Concretely, let's say we have a neural network with one input layer with ten units, hidden layer with ten units and one output layer with just one unit, so s1 is the number of units in layer one" "and s2 is the number of units in layer two, and s3 is a number of units in layer three." "In this case, the dimension of your matrices theta and" "D are going to be given by these expressions." "For example, theta one is going to a 10 by 11 matrix and so on." "So in if you want to convert between these matrices." "vectors." "What you can do is take your theta 1, theta" "2, theta 3, and write this piece of code and this will take all the elements of your three theta matrices and take all the elements of theta one, all the elements of theta 2, all the" "elements of theta 3, and unroll them and put all the elements into a big long vector." "Which is thetaVec and similarly the second command would take all of your D matrices and unroll them into a big long vector and call them" "dvec." "And finally if you want to go back from the vector representations to the matrix representations." "What you do to get back to theta one say is take thetaVec and pull out the first 110 elements." "So theta 1 has 110 elements because it's a" "10 by 11 matrix so that pulls out the first 110 elements and then you can use the reshape command to reshape those back into theta 1." "And similarly, to get back theta 2 you pull out the next 110 elements and reshape it." "And for theta 3, you pull out the final eleven elements and run reshape to get back the theta 3." "Here's a quick Octave demo of that process." "So for this example let's set theta 1 equal to be ones of 10 by" "11, so it's a matrix of all ones." "And just to make this easier seen, let's set that to be 2 times ones, 10 by" "11 and let's set theta 3 equals 3 times 1's of 1 by 11." "So this is 3 separate matrices: theta 1, theta 2, theta 3." "We want to put all of these as a vector." "ThetaVec equals theta 1; theta 2" "theta 3." "Right, that's a colon in the middle and like so" "and now thetavec is going to be a very long vector." "That's 231 elements." "If I display it, I find that this very long vector with all the elements of the first matrix, all the elements of the second matrix, then all the elements of the third matrix." "And if I want to get back my original matrices, I can do reshape thetaVec." "Let's pull out the first 110 elements and reshape them to a 10 by 11 matrix." "This gives me back theta 1." "And if I then pull out the next 110 elements." "So that's indices 111 to 220." "I get back all of my 2's." "And if I go from 221 up to the last element, which is element 231, and reshape to" "1 by 11, I get back theta 3." "To make this process really concrete, here's how we use the unrolling idea to implement our learning algorithm." "Let's say that you have some initial value of the parameters theta 1, theta 2, theta 3." "What we're going to do is take these and unroll them into a long vector we're gonna call initial theta to pass in to fminunc as this initial setting of the parameters theta." "The other thing we need to do is implement the cost function." "Here's my implementation of the cost function." "The cost function is going to give us input, thetaVec, which is going to be all of my parameters vectors that in the form that's been unrolled into a vector." "So the first thing I'm going to do is I'm going to use thetaVec and I'm going to use the reshape functions." "So I'll pull out elements from thetaVec and use reshape to get back my original parameter matrices, theta 1, theta 2, theta 3." "So these are going to be matrices that I'm going to get." "So that gives me a more convenient form in which to use these matrices so that I can run forward propagation and back propagation to compute my derivatives, and to compute my cost function j of theta." "And finally, I can then take my derivatives and unroll them, to keeping the elements in the same ordering as I did when I unroll my thetas." "But I'm gonna unroll D1, D2," "D3, to get gradientVec which is now what my cost function can return." "It can return a vector of these derivatives." "So, hopefully, you now have a good sense of how to convert back and forth between the matrix representation of the parameters versus the vector representation of the parameters." "The advantage of the matrix representation is that when your parameters are stored as matrices it's more convenient when you're doing forward propagation and back propagation and it's easier when your parameters are stored as matrices to take advantage" "of the, sort of, vectorized implementations." "Whereas in contrast the advantage of the vector representation, when you have like thetaVec or DVec is that when you are using the advanced optimization algorithms." "Those algorithms tend to assume that you have all of your parameters unrolled into a big long vector." "And so with what we just went through, hopefully you can now quickly convert between the two as needed." "In the last few videos, we talked about how to do forward-propagation and back-propagation in a neural network in order to compute derivatives." "But back prop as an algorithm has a lot of details and, you know, can be a little bit tricky to implement." "And one unfortunate property is that there are many ways to have subtle bugs in back prop so that if you run it with gradient descent or some other optimization algorithm, it could actually look like it's working." "And, you know, your cost function J of theta may end up decreasing on every iteration of gradient descent, but this could pull through even though there might be some bug in your implementation of back prop." "So it looks like J of theta is decreasing, but you might just wind up with a neural network that has a higher level of error than you would with a bug-free implementation and you might just not know that there" "was this subtle bug that's giving you this performance." "So what can we do about this?" "There's an idea called gradient checking that eliminates almost all of these problems." "So today, every time I implement back propagation or a similar gradient descent algorithm on the neural network or any other reasonably complex model, I always implement gradient checking." "And if you do this it will help you make sure and sort of gain high confidence that your implementation of forward prop and back prop or whatever, is 100% correct." "And in what I've seen this pretty much all the problems associated with sort of a buggy implementation of the background." "And in the previous videos," "I sort of ask you to take on faith that the formulas I gave for computing the deltas, and the" "D's, and so on, I ask you to take on faith that those actually do compute the gradients of the cost function, but once you implement numerical gradient checking, which is the topic of this video," "you'll be able to verify for yourself that the code you're writing is indeed computing" "the derivative of the cost function J. So here's the idea." "Consider the following example." "Suppose I have the function" "J of theta, and I have some value, theta, and for this example, I'm going to assume that theta is just a real number." "And let's say I want to estimate the derivative of this function at this point." "And so the derivative is, you know, equal to the slope of that sort of tangent line." "Here's how I'm going to numerically approximate the derivative, or rather here's a procedure for numerically approximating the derivative:" "I'm going to compute theta plus epsilon, so value a little bit to the right." "And we are going to compute theta minus epsilon." "And I'm going to look at those two points and connect them by a straight line." "And I'm going to connect these two points by a straight line and I'm going to use the slope of that little red line as my approximation to the derivative, which is the true derivative is the slope of the blue line over there." "So, you know, it seems like it would be a pretty good approximation." "Mathematically, the slope of this red line is this vertical height, divided by this horizontal width, so this point on top is J of" "theta plus epsilon." "This point here is J of theta minus epsilon." "So this vertical difference is j of theta plus epsilon, minus J of theta, minus epsilon, and this horizontal distance is just 2 epsilon." "So, my approximation is going to be that the derivative," "with respect to theta of J of theta--add this value of theta--that that's approximately J of theta plus epsilon, minus" "J of theta, minus epsilon, over 2 epsilon." "Usually, I use a pretty small value for epsilon and set epsilon to be maybe on the order of 10 to the minus 4." "There's usually a large range of different values for epsilon that work just fine." "And in fact, if you let epsilon become really small then, mathematically, this term here actually, mathematically, you know, becomes the derivative, becomes exactly the slope of the function at this point." "It's just that we don't want to use epsilon that's too, too small because then you might run into numerical problems." "So, you know, I usually use epsilon around 10 to the minus 4, say." "And by the way some of you may have seen it alternative formula for estimating the derivative which is this formula." "This one on the right is called the one-sided difference." "Whereas, the formula on the left that's called a two-sided difference." "The two-sided difference gives us a slightly more accurate estimate, so I usually use that rather than just this one-sided difference estimate." "So, concretely, what you implement in Octave is you implement the following." "You implement call to compute, gradApprox which is going to be approximation to zero relative as just, you know, this formula:" "J of theta plus epsilon, minus J of theta, minus epsilon, divided by two times epsilon." "And this will give you a numerical estimate of the gradient at that point." "And in this example it seems like it's a pretty good estimate." "Now, on the previous slide, we consider the case of when theta was a real number." "Now, let's look at the more general case of where theta is a vector parameter." "So let's say theta is an" "Rn, and it might be unreal version of the parameters of our neural network." "So theta is a vector that has n elements, theta 1 up to theta n." "We can then use a similar idea to approximate all of the partial derivative terms." "Concretely, the partial derivative of a cost function with respect to the first parameter theta" "1, that can be obtained by taking J and increasing theta 1." "So you have J of theta 1 plus epsilon, and so on minus J of this theta" "1 minus epsilon and divide it by 2 epsilon." "The partial derivative respect to the second parameter theta 2, is again this thing, except you're taking J of, here you're increasing theta 2 by epsilon." "And here you're decreasing theta 2 by epsilon." "And so on down to the derivative with respect to theta n." "Would be if you increase and decrease theta n by epsilon over there." "So, these equations give you a way to numerically approximate" "the partial derivative of "J"" "with respect to any one of your parameters they derive." "Concretely, what you implement is therefore, the following." "We implement the following in Octave to numerically compute the derivatives." "We say for i equals 1 through n where n is the dimension of our parameter vector theta." "And I usually do this with the unrolled version of the parameters." "So you know theta is just a long list of all of my parameters in my neural networks." "I'm going to set theta plus equals theta, then increase theta plus the ith element by epsilon." "And so this is basically theta plus is equal to theta except for theta plus i, which is now incremented by epsilon." "So if theta plus is equal to, right, theta" "1, theta 2, and so on and then theta i has epsilon added to it, and then it go down to theta n." "So this is what theta plus is." "And similarly these two lines set theta minus to something similar except that this, instead of theta i plus epsilon, this now becomes theta i minus epsilon." "And then finally, you implement this gradApprox i, and this will give you your approximation to the partial derivative with respect to theta i of J of theta." "And the way we use this in our neural network implementation is we would implement this, implement this" "FOR loop to compute, you know, the top partial derivative of the cost function with respect to every parameter in our network." "And we can then take the gradient that we got from back prop." "So DVec was the derivatives we got from back prop." "Right, so back prop, back-propagation was a relatively efficient way to compute the derivatives or the partial derivatives of a cost function with respect to all of our parameters." "And what I usually do is then take my numerically computed derivative, that is this gradApprox that we just had from up here and make sure that that is equal or approximately equal up to, you know, small values" "of numerical round off that is pretty close to the DVec that I got from back prop." "And if these two ways of computing the derivative give me the same answer or at least give me very similar answers, you know, up to a few decimal places." "Then I'm much more confident that my implementation of back prop is correct." "And when I plug these DVec vectors into gradient descent or some advanced optimization algorithm," "I can then be much more confident that I'm computing the derivatives correctly and therefore, that hopefully my codes will run correctly and do a good job optimizing J of theta." "Finally, I want to put everything together and tell you how to implement this numerical gradient checking." "Here's what I usually do." "First thing I do, is implement back-propagation to compute defects." "So, this is a procedure we talked about in an earlier video to compute DVec which may be our unrolled version of these matrices." "Then what I do, is implement a numerical gradient checking to compute gradApprox." "So this is what I described earlier in this video, in the previous slide." "Then you should make sure that DVec and gradApprox gives similar values, you know, let's say up to a few decimal places." "And finally, and this the important step, the more you start to use your code for learning, for seriously training your network, it is important to turn off gradient checking." "And to no longer compute this gradApprox thing using the numerical derivative formulas that we talked about earlier in this" "video." "And the reason for that is the numeric code gradient checking code, the stuff we talked about in this video, that's a very computationally expensive, that's a very slow way to try to approximate the derivative." "Whereas in contrast, the back-propagation algorithm that we talked about earlier, that is the thing that we talked about earlier for computing, you know, D1, D2," "D3, or for DVec." "Back prop is a much more computationally efficient way of computing the derivatives." "So once you've verified that your implementation of back-propagation is correct, you should turn off gradient checking, and just stop using that." "So just to reiterate, you should be sure to disable your gradient checking code before running your algorithm for many iterations of gradient descent, or for many iterations of the advanced optimization algorithms in order to train your classifier." "Concretely, if you were to run numerical gradient checking on every single integration of gradient descent, or if you were in the inner loop of your cost function, then your code will be very slow." "Because the numerical gradient checking code is much slower than the back-propagation algorithm, than a back-propagation method where you remember we were computing delta" "4, delta 3, delta 2, and so on." "That was the back-propagation algorithm." "That is a much faster way to compute derivatives than gradient checking." "So when you're ready, once you verify the implementation of back-propagation is correct, make sure you turn off, or you disable your gradient checking code while you train your algorithm, or else your code could run very slowly." "So that's how you take gradients numerically." "And that's how you can verify that your implementation of back-propagation is correct." "Whenever I implement back-propagation or similar gradient descent algorithm for a complicated model, I always use gradient checking." "This really helps me make sure that my code is correct." "In the previous videos, we put together almost all the pieces you need in order to implement and train in your network." "There's just one last idea I need to share with you, which is the idea of random initialization." "When you're running an algorithm like gradient descent or also the advanced optimization algorithms, we need to pick some initial value for the parameters theta." "So for the advanced optimization algorithm, you know, it assumes that you will pass it some initial value for the parameters theta." "Now let's consider gradient descent." "For that, you know, we also need to initialize theta to something." "And then we can slowly take steps go downhill, using graded descent, to go downhill to minimize the function J of theta." "So what do we set the initial value of theta to?" "Is it possible to set the initial value of theta to the vector of all zeroes." "Whereas this worked okay when we were using logistic regression." "Initializing all of your parameters to zero actually does not work when you're trading a neural network." "Consider training the following neural network." "And let's say we initialized all of the parameters in the network to zero." "And if you do that then what that means is that at the initialization this blue weight, that I'm covering blue" "is going to equal to that weight." "So, they're both zero." "And this weight that I'm covering in in red, is equal to that weight." "Which I'm covering it in red." "And also this weight, well which I'm covering it in green is going to be equal to the value of that weight." "And what that means is that both of your hidden units: a1 and a2 are going to be computing the same function of your inputs." "And thus, you end up with for everyone of your training your examples." "You end up with a(2)1 equals a(2)2." "and moreover because, I'm not going to show this too much detail, but because these out going weights are the same you can also show that the delta values are also going to be the same." "So concretely, you end up with delta 1 1, delta 2 1, equals delta 2 2." "And if you work through the map further, what you can show is that the partial derivatives" "with respect to your parameters will satisfy the following." "That the partial derivative of the cost function with respect to writing out the derivatives respect to these two blue weights neural network." "You'll find that these two partial derivatives are going to be equal to each other." "And so, what this means, is that even after say, one gradient descent update." "You're going to update, say this first blue weight with, you know, learning rate times this." "And you're going to update the second blue weight to a sum learning rate times this." "But what this means is that even after one gradient descent update, those two blue weights, those two blue color parameters will end up the same as each other." "So they'll be some non-zero value now, but this value will be equal to that value." "And similarly, even after one gradient descent update." "This value will equal to that value." "There will be some non-zero values." "Just that the two red values will be equal to each other." "And similarly the two green weights, they'll both change values but they'll both end up the same value as each other." "So after each update, the parameters corresponding to the inputs going to each of the two hidden units identical." "That's just saying that the two green weights must be sustained, the two red weights must be sustained, the two blue weights are still the same and what that means is that even after one iteration of say, gradient" "descent, you find that your two hidden units are still computing exactly the same function that the input." "So you still have this a(1)2 equals a(2)2." "And so you're back to this case." "And as keep running gradient descent." "The blue weights, the two blue weights will stay the same as each other." "The two red weights will stay the same as each other." "The two green weights will stay the same as each other." "And what this means is that your neural network really can't compute very interesting functions." "Imagine that you had not only two hidden units but imagine that you had many many hidden units." "Then what this is saying is that all of your hidden units are computing the exact same feature, all of your hidden units are computing all of the exact same function of the input." "And this is a highly redundant representation." "Because that means that your final logistic regression unit, you know, really only gets to see one feature." "Because all of these are the same and this prevents your neural network from learning something interesting." "In order to get around this problem, the way we initialize the parameters of a neural network therefore, is with random initialization." "Concretely, the problem we saw on the previous slide is sometimes called the problem of symmetric weights, that is if the weights all being the same." "And so this random initialization is how we perform symmetry breaking." "So what we do is we initialize each value of theta to a random number between minus epsilon and epsilon." "So this is a notation to mean numbers between minus epsilon and plus epsilon." "So my weights on my parameters are all going to be randomly initialized between minus epsilon and plus epsilon." "The way I write code to do this in octave, this I've said you know theta 1 to be equal to this." "So this rand 10 by 11." "That's how you compute a random 10 by 11" "dimensional matrix, and all of the values are between 0 and 1." "So these are going to be real numbers that take on any continuous values between 0 and 1." "And so, if you take a number between 0 and" "1, multiply it by 2 times an epsilon, and minus an epsilon, then you end up with a number that's between minus epsilon and plus epsilon." "And incidentally, this epsilon here has nothing to do with the epsilon that we were using when we were doing gradient checking." "So when we were doing numerical gradient checking, there we were adding some values of epsilon to theta." "This is, you know, an unrelated value of epsilon." "Which is why I am denoting in it epsilon, just to distinguish it from the value of epsilon we were using in gradient checking." "Absolutely, if you want to initialize theta 2 to a random 1 by" "11 matrix, you can do so using this piece of code here." "So, to summarize, to train a neural network, what you should do is randomly initialize the weights to, you know, small values close to 0, between minus epsilon and plus epsilon, say, and then implement" "back-propagation; do gradient checking;" "and use either gradient descent or one of the advanced optimization algorithms to try to minimize J of theta as a function of the parameters theta starting from just randomly chosen initial value for the parameters." "And by doing symmetry breaking, which is this process." "Hopefully, gradient descent or the advanced optimization algorithms will be able to find a good value of theta." "So, it's taken us a lot of videos to get through the neural network learning algorithm." "In this video, what I'd like to do is try to put all the pieces together, to give a overall summary or a bigger picture view, of how all the pieces fit together and of the overall process of how" "to implement a neural network learning algorithm." "When training a neural network, the first thing you need to do is pick some network architecture and by architecture I just mean connectivity pattern between the neurons." "So, you know, we might choose between say, a neural network with three input units and five hidden units and four output units versus one of 3, 5 hidden, 5 hidden, 4 output and here are 3, 5," "5, 5 units in each of three hidden layers and four open units, and so these choices of how many hidden units in each layer and how many hidden layers, those are architecture choices." "So, how do you make these choices?" "Well first, the number of input units well that's pretty well defined." "And once you decides on the fix set of features x the number of input units will just be, you know, the dimension of your features x(i) would be determined by that." "And if you are doing multiclass classifications the number of output of this will be determined by the number of classes in your classification problem." "And just a reminder if you have a multiclass classification where y takes on say values between" "1 and 10, so that you have ten possible classes." "Then remember to right, your output y as these were the vectors." "So instead of clause one, you recode it as a vector like that, or for the second class you recode it as a vector like that." "So if one of these apples takes on the fifth class, you know, y equals 5, then what you're showing to your neural network is not actually a value of y equals 5, instead here at the upper layer which would" "have ten output units, you will instead feed to the vector which you know" "with one in the fifth position and a bunch of zeros down here." "So the choice of number of input units and number of output units is maybe somewhat reasonably straightforward." "And as for the number of hidden units and the number of hidden layers, a reasonable default is to use a single hidden layer and so this type of neural network shown on the left with just one hidden layer is probably the most common." "Or if you use more than one hidden layer, again the reasonable default will be to have the same number of hidden units in every single layer." "So here we have two hidden layers and each of these hidden layers have the same number five of hidden units and here we have, you know, three hidden layers and each of them has the same number, that is five hidden units." "Rather than doing this sort of network architecture on the left would be a perfect ably reasonable default." "And as for the number of hidden units - usually, the more hidden units the better;" "it's just that if you have a lot of hidden units, it can become more computationally expensive, but very often, having more hidden units is a good thing." "And usually the number of hidden units in each layer will be maybe comparable to the dimension of x, comparable to the number of features, or it could be any where from same number of hidden units of input features to" "maybe so that three or four times of that." "So having the number of hidden units is comparable." "You know, several times, or some what bigger than the number of input features is often a useful thing to do So, hopefully this gives you one reasonable set of default choices for neural architecture and and if you follow these guidelines, you" "will probably get something that works well, but in a later set of videos where" "I will talk specifically about advice for how to apply algorithms, I will actually say a lot more about how to choose a neural network architecture." "Or actually have quite a lot I want to say later to make good choices for the number of hidden units, the number of hidden layers, and so on." "Next, here's what we need to implement in order to trade in neural network, there are actually six steps that I have;" "I have four on this slide and two more steps on the next slide." "First step is to set up the neural network and to randomly initialize the values of the weights." "And we usually initialize the weights to small values near zero." "Then we implement forward propagation so that we can input any excellent neural network and compute h of x which is this output vector of the y values." "We then also implement code to compute this cost function j of theta." "And next we implement back-prop, or the back-propagation" "algorithm, to compute these partial derivatives terms, partial derivatives of j of theta with respect to the parameters." "Concretely, to implement back prop." "Usually we will do that with a fore loop over the training examples." "Some of you may have heard of advanced, and frankly very advanced factorization methods where you don't have a four-loop over the m-training examples, that the first time you're implementing back prop there should almost certainly the four" "loop in your code, where you're iterating over the examples, you know, x1, y1, then so you do forward prop and back prop on the first example, and then in the second iteration of the" "four-loop, you do forward propagation and back propagation on the second example, and so on." "Until you get through the final example." "So there should be a four-loop in your implementation of back prop, at least the first time implementing it." "And then there are frankly somewhat complicated ways to do this without a four-loop, but" "I definitely do not recommend trying to do that much more complicated version the first time you try to implement back prop." "So concretely, we have a four-loop over my m-training examples" "and inside the four-loop we're going to perform fore prop and back prop using just this one example." "And what that means is that we're going to take x(i), and feed that to my input layer, perform forward-prop, perform back-prop and that will if all of these activations and all of these delta terms for all" "of the layers of all my units in the neural network then still inside this four-loop, let me draw some curly braces just to show the scope with the four-loop, this is in" "octave code of course, but it's more a sequence Java code, and a four-loop encompasses all this." "We're going to compute those delta terms, which are is the formula that we gave earlier." "Plus, you know, delta l plus one times a, I transpose of the code." "And then finally, outside the having computed these delta terms, these accumulation terms, we would then have some other code and then that will allow us to compute these partial derivative terms." "Right and these partial derivative terms have to take into account the regularization term lambda as well." "And so, those formulas were given in the earlier video." "So, how do you done that you now hopefully have code to compute these partial derivative terms." "Next is step five, what I do is then use gradient checking to compare these partial derivative terms that were computed." "So, I've compared the versions computed using back propagation versus the partial derivatives computed using the numerical" "estimates as using numerical estimates of the derivatives." "So, I do gradient checking to make sure that both of these give you very similar values." "Having done gradient checking just now reassures us that our implementation of back propagation is correct, and is then very important that we disable gradient checking, because the gradient checking code is computationally very slow." "And finally, we then use an optimization algorithm such as gradient descent, or one of the advanced optimization methods such as LB of GS, contract gradient has embodied into fminunc or other optimization methods." "We use these together with back propagation, so back propagation is the thing that computes these partial derivatives for us." "And so, we know how to compute the cost function, we know how to compute the partial derivatives using back propagation, so we can use one of these optimization methods to try to minimize j of theta as a function of the parameters theta." "And by the way, for neural networks, this cost function j of theta is non-convex, or is not convex and so it can theoretically be susceptible to local minima, and in fact algorithms like gradient descent and" "the advance optimization methods can, in theory, get stuck in local" "optima, but it turns out that in practice this is not usually a huge problem and even though we can't guarantee that these algorithms will find a global optimum, usually algorithms like gradient descent will do a very good job minimizing this" "cost function j of theta and get a very good local minimum, even if it doesn't get to the global optimum." "Finally, gradient descents for a neural network might still seem a little bit magical." "So, let me just show one more figure to try to get that intuition about what gradient descent for a neural network is doing." "This was actually similar to the figure that I was using earlier to explain gradient descent." "So, we have some cost function, and we have a number of parameters in our neural network." "Right here I've just written down two of the parameter values." "In reality, of course, in the neural network, we can have lots of parameters with these." "Theta one, theta two--all of these are matrices, right?" "So we can have very high dimensional parameters but because of the limitations the source of parts we can draw." "I'm pretending that we have only two parameters in this neural network." "Although obviously we have a lot more in practice." "Now, this cost function j of theta measures how well the neural network fits the training data." "So, if you take a point like this one, down here," "that's a point where j of theta is pretty low, and so this corresponds to a setting of the parameters." "There's a setting of the parameters theta, where, you know, for most of the training examples, the output" "of my hypothesis, that may be pretty close to y(i) and if this is true than that's what causes my cost function to be pretty low." "Whereas in contrast, if you were to take a value like that, a point like that corresponds to, where for many training examples, the output of my neural network is far from the actual value y(i)" "that was observed in the training set." "So points like this on the line correspond to where the hypothesis, where the neural network is outputting values on the training set that are far from y(i)." "So, it's not fitting the training set well, whereas points like this with low values of the cost function corresponds to where j of theta is low, and therefore corresponds to where the neural network happens to be fitting my training set" "well, because I mean this is what's needed to be true in order for j of theta to be small." "So what gradient descent does is we'll start from some random initial point like that one over there, and it will repeatedly go downhill." "And so what back propagation is doing is computing the direction of the gradient, and what gradient descent is doing is it's taking little steps downhill until hopefully it gets to, in this case, a pretty good local optimum." "So, when you implement back propagation and use gradient descent or one of the advanced optimization methods, this picture sort of explains what the algorithm is doing." "It's trying to find a value of the parameters where the output values in the neural network closely matches the values of the y(i)'s observed in your training set." "So, hopefully this gives you a better sense of how the many different pieces of neural network learning fit together." "In case even after this video, in case you still feel like there are, like, a lot of different pieces and it's not entirely clear what some of them do or how all of these pieces come together, that's actually okay." "Neural network learning and back propagation is a complicated algorithm." "And even though I've seen the math behind back propagation for many years and I've used back propagation, I think very successfully, for many years, even today I still feel like I don't always have a great" "grasp of exactly what back propagation is doing sometimes." "And what the optimization process looks like of minimizing j if theta." "Much this is a much harder algorithm to feel like I have a much less good handle on exactly what this is doing compared to say, linear regression or logistic regression." "Which were mathematically and conceptually much simpler and much cleaner algorithms." "But so in case if you feel the same way, you know, that's actually perfectly okay, but if you do implement back propagation, hopefully what you find is that this is one of the most powerful" "learning algorithms and if you implement this algorithm, implement back propagation, implement one of these optimization methods, you find that back propagation will be able to fit very complex, powerful, non-linear functions to your data," "and this is one of the most effective learning algorithms we have today." "In this video, I'd like to show you a fun and historically important example of Neural Network Learning." "Of using a Neural Network for autonomous driving that is getting a car to learn to drive itself." "The video that I showed a minute, was something that I've gotten from Dean Pomilieu, who Colleague who works out in Carnegie Mellon University out on the east coast of the United States," "and in part of the video you see visualizations like this, and what I should tell you what the visualization looks like before starting to video." "Down here on the lower left is the view seen by the car of what's in front of it and so here you know, you will kind of see you know, a road that's maybe going a bit to" "the left and going a little bit to the right, and up here on top, this first horizontal bar shows the direction selected by the human driver and is the location of this bright white band that shows the" "steering direction selected by the human driver, where, you know, here, far to the left corresponds to steering hard left;" "here corresponds to steering hard to the right; and so this location, which is a little bit to the left, a little bit left of center, means that the human driver, at this point, was" "steering slightly to the left." "A nd this second part here corresponds to the steering direction selected by the learning algorithm; and again, the location of this sort of white band, means the neural network was here, selecting a steering direction just slightly to" "the left and in fact, before the neural network starts learning initially, you see that the network outputs a grey band, like a grey uniform, grey band throughout this region, so the uniform grey fuzz corresponds to the" "neural network having been randomly initialized, and initially having no idea how to drive the car, or initially having no idea what direction to steer in." "And it's only after it's learned for a while that it will then start to output like a solid white band in just a small part of the region corresponding to choosing a particular steering direction." "And that corresponds to when a neural network." "Becomes more confident in selecting, you know, a and in one location rather than outputting a sort of light gray fuzz, but instead outputting a white band that's more constantly selecting one steering direction." "Alban is a system of artificial neural networks, that learns to steer by watching a person drive." "Alban is designed to control the tube a modified army" "Humvee who could put sensors, computers and actuators for autonomous navigation experiments." "The initial spec in configuring Alban is training in" "the training the person drives to be a car while Alban watches." "Once every two seconds, Alban digitizes a video image of the road ahead, and records the person's steering direction." "This training image is reduced in resolution to 30 by" "32 pixels and provided as input to Alban's three-layer" "network." "Using the back propagation learning algorithm;" "Alban is training to output the same steering direction as the human driver for that image" "Initially, the network's steering response is random." "After about two minutes of training, the network learns to accurately imitate the steering reactions of the" "human driver." "This same training procedure is repeated for other road types." "After the networks have been trained the operator pushes the run switch and often begins driving. 12 times per second, Alban digitizes an image and feeds it to its neural networks." "Each network, running in parallel, produces a steering direction and a measure of it's" "confidence in its response." "The steering direction from the most confident network." "In this case, the network trained for the one-lane road is used to control the vehicle." "Suddenly, an intersection appears ahead of the vehicle." "As the vehicle approaches the intersection, the confidence of the one-lane network decreases." "As it crosses the intersection, and the two-lane road ahead comes into view, the confidence of the two-lane network rises." "When it's confidence rises, the two-lane network is selected to steer, safely guiding the vehicle into it's lane, on the two-lane road." "So that was autonomous driving using a neural network." "Of course, there are more recently more modern attempts to do autonomous driving in a few properties, in the U.S., in Europe, and so on." "They're giving more robust driving controllers than this, but I think it's still pretty remarkable and pretty amazing how a simple neural network trained with back-propagation can, you know, actually learn to drive a car somewhat well." "By now you have seen a lot of different learning algorithms." "And if you've been following along these videos you should consider yourself an expert on many state-of-the-art machine learning techniques." "But even among people that know a certain learning algorithm." "There's often a huge difference between someone that really knows how to powerfully and effectively apply that algorithm, versus someone that's less familiar with some of the material that I'm about to teach and who doesn't really" "understand how to apply these algorithms and can end up wasting a lot of their time trying things out that don't really make sense." "What I would like to do is make sure that if you are developing machine learning systems, that you know how to choose one of the most promising avenues to spend your time pursuing." "And on this and the next few videos I'm going to give a number of practical suggestions, advice, guidelines on how to do that." "And concretely what we'd focus on is the problem of, suppose you are developing a machine learning system or trying to improve the performance of a machine learning system, how do you go about deciding what are the proxy avenues to try" "next?" "To explain this, let's continue using our example of learning to predict housing prices." "And let's say you've implement and regularize linear regression." "Thus minimizing that cost function j." "Now suppose that after you take your learn parameters, if you test your hypothesis on the new set of houses, suppose you find that this is making huge errors in this prediction of the housing prices." "The question is what should you then try mixing in order to improve the learning algorithm?" "There are many things that one can think of that could improve the performance of the learning algorithm." "One thing they could try, is to get more training examples." "And concretely, you can imagine, maybe, you know, setting up phone surveys, going door to door, to try to get more data on how much different houses sell for." "And the sad thing is I've seen a lot of people spend a lot of time collecting more training examples, thinking oh, if we have twice as much or ten times as much training data, that is certainly going to help, right?" "But sometimes getting more training data doesn't actually help and in the next few videos we will see why, and we will see how you can avoid spending a lot of time collecting more training data in settings where it is just not going to help." "Other things you might try are to well maybe try a smaller set of features." "So if you have some set of features such as x1, x2, x3 and so on, maybe a large number of features." "Maybe you want to spend time carefully selecting some small subset of them to prevent overfitting." "Or maybe you need to get additional features." "Maybe the current set of features aren't informative enough and you want to collect more data in the sense of getting more features." "And once again this is the sort of project that can scale up the huge projects can you imagine getting phone surveys to find out more houses, or extra land surveys to find out more about the pieces of land and so on, so a huge project." "And once again it would be nice to know in advance if this is going to help before we spend a lot of time doing something like this." "We can also try adding polynomial features things like x2 square x2 square and product features x1, x2." "We can still spend quite a lot of time thinking about that and we can also try other things like decreasing lambda, the regularization parameter or increasing lambda." "Given a menu of options like these, some of which can easily scale up to six month or longer projects." "Unfortunately, the most common method that people use to pick one of these is to go by gut feeling." "In which what many people will do is sort of randomly pick one of these options and maybe say, "Oh, lets go and get more training data."" "And easily spend six months collecting more training data or maybe someone else would rather be saying, "Well, let's go collect a lot more features on these houses in our data set."" "And I have a lot of times, sadly seen people spend, you know, literally 6 months doing one of these avenues that they have sort of at random only to discover six months later that that really wasn't a promising avenue to pursue." "Fortunately, there is a pretty simple technique that can let you very quickly rule out half of the things on this list as being potentially promising things to pursue." "And there is a very simple technique, that if you run, can easily rule out many of these options," "and potentially save you a lot of time pursuing something that's just is not going to work." "In the next two videos after this, I'm going to first talk about how to evaluate learning algorithms." "And in the next few videos after that, I'm going to talk about these techniques," "which are called the machine learning diagnostics." "And what a diagnostic is, is a test you can run, to get insight into what is or isn't working with an algorithm, and which will often give you insight as to what are promising things to try" "to improve a learning algorithm's performance." "We'll talk about specific diagnostics later in this video sequence." "But I should mention in advance that diagnostics can take time to implement and can sometimes, you know, take quite a lot of time to implement and understand but doing so can be a very good use of your time when you are" "developing learning algorithms because they can often save you from spending many months pursuing an avenue that you could have found out much earlier just was not going to be fruitful." "So in the next few videos, I'm going to first talk about how evaluate your learning algorithms and after that I'm going to talk about some of these diagnostics which will hopefully let you much more effectively select more of the" "useful things to try mixing if your goal to improve the machine learning system." "In the last couple videos, we talked about the ideas of how, first, if you're given features for movies, you can use that to learn parameters data for users." "And second, if you're given parameters for the users, you can use that to learn features for the movies." "In this video we're going to take those ideas and put them together to come up with a collaborative filtering algorithm." "So one of the things we worked out earlier is that if you have features for the movies then you can solve this minimization problem to find the parameters theta for your users." "And then we also worked that out, if you are given the parameters theta, you can also use that to estimate the features x, and you can do that by solving this minimization problem." "So one thing you could do is actually go back and forth." "Maybe randomly initialize the parameters and then solve for theta, solve for x, solve for theta, solve for x." "But, it turns out that there is a more efficient algorithm that doesn't need to go back and forth between the x's and the thetas, but that can solve for theta and x simultaneously." "And here it is." "What we are going to do, is basically take both of these optimization objectives, and put them into the same objective." "So I'm going to define the new optimization objective j, which is a cost function, that is a function of my features x and a function of my parameters theta." "And, it's basically the two optimization objectives" "I had on top, but I put together." "So, in order to explain this, first, I want to point out that this term over here, this squared error term, is the same as this squared error term and the summations look a little bit" "different, but let's see what the summations are really doing." "The first summation is sum over all users J and then sum over all movies rated by that user." "So, this is really summing over all pairs IJ, that correspond to a movie that was rated by a user." "Sum over J says, for every user, the sum of all the movies rated by that user." "This summation down here, just does things in the opposite order." "This says for every movie" "I, sum over all the users J that have rated that movie and so, you know these summations, both of these are just summations over all pairs ij for which r of i J is equal to 1." "It's just something over all the user movie pairs for which you have a rating." "and so those two terms up there is just exactly this first term, and" "I've just written the summation here explicitly, where I'm just saying the sum of all pairs IJ, such that" "RIJ is equal to 1." "So what we're going to do is define a combined optimization objective that we want to minimize in order to solve simultaneously for x and theta." "And then the other terms in the optimization objective are this, which is a regularization in terms of theta." "So that came down here and the final piece is this term which is my optimization objective for the x's and that became this." "And this optimization objective j actually has an interesting property that if you were to hold the x's constant and just minimize with respect to the thetas then you'd be solving exactly this problem, whereas if you were to do" "the opposite, if you were to hold the thetas constant, and minimize j only with respect to the x's, then it becomes equivalent to this." "Because either this term or this term is constant if you're minimizing only the respective x's or only respective thetas." "So here's an optimization objective that puts together my cost functions in terms of x and in terms of theta." "And in order to come up with just one optimization problem, what we're going to do, is treat this cost function, as a function of my features x and of my user pro user parameters data and" "just minimize this whole thing, as a function of both the" "Xs and a function of the thetas." "And really the only difference between this and the older algorithm is that, instead of going back and forth, previously we talked about minimizing with respect to theta then minimizing with respect to x, whereas minimizing with respect to theta," "minimizing with respect to x and so on." "In this new version instead of sequentially going between the" "2 sets of parameters x and theta, what we are going to do is just minimize with respect to both sets of parameters simultaneously." "Finally one last detail is that when we're learning the features this way." "Previously we have been using this convention that we have a feature x0 equals one that corresponds to an interceptor." "When we are using this sort of formalism where we're are actually learning the features, we are actually going to do away with this convention." "And so the features we are going to learn x, will be in Rn." "Whereas previously we had features x and Rn + 1 including the intercept term." "By getting rid of x0 we now just have x in Rn." "And so similarly, because the parameters theta is in the same dimension, we now also have theta in RN because if there's no x0, then there's no need parameter theta 0 as well." "And the reason we do away with this convention is because we're now learning all the features, right?" "So there is no need to hard code the feature that is always equal to one." "Because if the algorithm really wants a feature that is always equal to 1, it can choose to learn one for itself." "So if the algorithm chooses, it can set the feature X1 equals 1." "So there's no need to hard code the feature of" "001, the algorithm now has the flexibility to just learn it by itself." "So, putting everything together, here is our collaborative filtering algorithm." "first we are going to initialize x and theta to small random values." "And this is a little bit like neural network training, where there we were also initializing all the parameters of a neural network to small random values." "Next we're then going to minimize the cost function using great intercepts or one of the advance optimization algorithms." "So, if you take derivatives you find that the great intercept like these and so this term here is the partial derivative of the cost function," "I'm not going to write that out, with respect to the feature value Xik and similarly this term here is also a partial derivative value of the cost function with respect to the parameter theta that we're minimizing." "And just as a reminder, in this formula that we no longer have this X0 equals" "1 and so we have that x is in Rn and theta is a Rn." "In this new formalism, we're regularizing every one of our perimeters theta, you know, every one of our parameters Xn." "There's no longer the special case theta zero, which was regularized differently, or which was not regularized compared to the parameters theta 1 down to theta." "So there is now no longer a theta 0, which is why in these updates, I did not break out a special case for k equals 0." "So we then use gradient descent to minimize the cost function j with respect to the features x and with respect to the parameters theta." "And finally, given a user, if a user has some parameters, theta, and if there's a movie with some sort of learned features x, we would then predict that that movie would be given a" "star rating by that user of theta transpose j." "Or just to fill those in, then we're saying that if user" "J has not yet rated movie I, then what we do is predict that user J is going to rate movie I according to theta J transpose Xi." "So that's the collaborative filtering algorithm and if you implement this algorithm you actually get a pretty decent algorithm that will simultaneously learn good features for hopefully all the movies as well as learn parameters for all the users and hopefully give" "pretty good predictions for how different users will rate different movies that they have not yet rated" "In this video, I would like to talk about how to evaluate a hypothesis that has been learned by your algorithm." "In later videos, we will build on this to talk about how to prevent in the problems of overfitting and underfitting as well." "When we fit the parameters of our learning algorithm we think about choosing the parameters to minimize the training error." "One might think that getting a really low value of training error might be a good thing, but we have already seen that just because a hypothesis has low training error, that doesn't mean it is necessarily a good hypothesis." "And we've already seen the example of how a hypothesis can overfit." "And therefore fail to generalize the new examples not in the training set." "So how do you tell if the hypothesis might be overfitting." "In this simple example we could plot the hypothesis h of x and just see what was going on." "But in general for problems with more features than just one feature, for problems with a large number of features like these it becomes hard or may be impossible to plot what the hypothesis looks like and so we need some other way to evaluate our hypothesis." "The standard way to evaluate a learned hypothesis is as follows." "Suppose we have a data set like this." "Here I have just shown 10 training examples, but of course usually we may have dozens or hundreds or maybe thousands of training examples." "In order to make sure we can evaluate our hypothesis, what we are going to do is split" "the data we have into two portions." "The first portion is going to be our usual training set and the second portion is going to be our test set," "and a pretty typical split of all the data we have into a training set and test set might be around say a 70%," "30% split." "Worth more today to grade the training set and relatively less to the test set." "And so now, if we have some data set, we run a sine of say" "70% of the data to be our training set where here "m" is as usual our number of training examples and the remainder of our data might then be assigned to become our test set." "And here, I'm going to use the notation m subscript test to denote the number of test examples." "And so in general, this subscript test is going to denote examples that come from a test set so that x1 subscript test, y1 subscript test is my first test example which I guess in this example might be this example over here." "So a fairly typical procedure for training and testing a learning algorithm, maybe linear regression, would be to first learn the parameter vector from the training data." "And here the training data is only that first 70% of the dataset." "Finally, one last detail whereas here I've drawn this as though the first 70% goes to the training set and the last 30% to the test set." "If there is any sort of ordinary to the data." "That should be better to send a random 70% of your data to the training set and a random 30% of your data to the test set." "So if your data were already randomly sorted, you could just take the first 70% and last 30% that if your data were not randomly ordered, it would be better to randomly shuffle or to randomly reorder the examples in your training set." "Before you know sending the first 70% in the training set and the last" "30% of the test set." "Here then is a fairly typical procedure for how you would train and test the learning algorithm and the learning regression." "First, you learn the parameters theta from the training set so you minimize the usual training error objective j of theta, where j of theta here was defined using that" "70% of all the data you have." "There is only the training data." "And then you would compute the test error." "And I am going to denote the test error as j subscript test." "And so what you do is take your parameter theta that you have learned from the training set, and plug it in here and compute your test set error." "Which I am going to write as follows." "So this is basically the average squared error as measured on your test set." "It's pretty much what you'd expect." "So if we run every test example through your hypothesis" "with parameter theta and just measure the error" "that your hypothesis has on your m subscript test, test examples." "And of course, this is the definition of the test set error if we are using linear regression and using the squared error metric." "How about if we were doing a classification problem and say using logistic regression instead." "In that case, the procedure for training and testing say logistic regression is pretty similar" "first we will do the parameters from the training data, that first 70% of the data." "And it will compute the test error as follows." "It's the same objective function as we always use but we just leave a ration, except that now is define using our m subscript test, test examples." "While this definition of the test set error test is perfectly reasonable." "Sometimes there is an alternative test sets metric that might be easier to interpret, and that's the misclassification error." "It's also called the zero one misclassification error, with zero one denoting that you either get an example right or you get an example wrong." "Here's what I mean." "Let me define the error of a prediction." "That is h of x." "And given the label y as equal to one if my hypothesis" "outputs the value greater than equal to five and Y is equal to zero or if my hypothesis outputs a value of less than 0.5 and y is equal to one, right, so both of these cases basic respond" "to if your hypothesis mislabeled the example assuming your threshold at an 0.5." "So either thought it was more likely to be 1, but it was actually 0, or your hypothesis stored was more likely to be" "0, but the label was actually 1." "And otherwise, we define this error function to be zero." "If your hypothesis basically classified the example y correctly." "We could then define the test error, using the misclassification error metric to be one of the m tests of sum from i equals one to m subscript test of the error of h of x(i) test comma y(i)." "And so that's just my way of writing out that this is exactly the fraction of the examples in my test set that my hypothesis has mislabeled." "And so that's the definition of the test set error using the misclassification error of the" "0, 1 misclassification metric." "So that's the standard technique for evaluating how good a learned hypothesis is." "In the next video, we will adapt these ideas to helping us do things like choose what features like the degree polynomial to use with the learning algorithm or choose the regularization parameter for learning algorithm." "Suppose you like to decide what degree of polynomial to fit to a data set, sort of what features to include to give you a learning algorithm." "Or suppose you'd like to choose the regularization parameter lambda for the learning algorithm." "How do you do that?" "These are called model selection problems." "And in our discussion of how to do this we'll talk about not just how to split your data into a train and test sets but how to split your data into what we'll discover is called the train validation and test sets." "We'll see in this video just what these things are and how to use them to do model selection." "We've already seen a lot of times the problem of overfitting, in which just because the learning algorithm fits a training set well, that doesn't mean there's a good hypothesis." "More generally, this is why the training set error is not a good predictor for how well the hypothesis will do on new examples." "Concretely, if you fit some set of parameters - theta" "0, theta 1, theta 2 and so on - to your training set then, the fact that your hypothesis does well in the training set, well, this doesn't mean much in terms of predicting how well your" "hypothesis will generalize to new examples not seen in the training set." "And the more general principal is that, once your parameters were fit to some set of data--maybe the training set, maybe something else--then the error of your hypothesis as measured on that same data set, such as the" "training error, that's unlikely to be a good estimate of your actual generalization error, that is, of how well the hypothesis will generalize to new examples." "Now let's consider the model selection problem." "Let's say you try to choose what degree polynomial to fit to data." "So, you should you choose a linear function, a quadratic function, a cubic function, all the way up to a 10th power polynomial?" "So it's as if there's one extra parameter in this algorithm, which I'm going to denote d, which is what degree of polynomial do you want to pick?" "So it is as if does this, in addition to the theta parameters it's as if there's one more parameter d that your trying to determine using your data cells." "the first option is d equals 1, which is for the linear function we can choose d equals 2, d equals 3, all the way up to d equals 10, so we would like to fit this extra sort of parameter," "which I am denoting by d, and concretely, let's say that you want to choose a model, that is choose a degree of polynomial choose one off these ten models, and fit that model" "and also get some estimate of how well your fitted hypothesis will generalize to new examples." "Here's one thing you could do: you could take your first model and minimize the training error and this would give you some parameter vector theta, and you can then take your second model, the quadratic function and" "for that your training set and this will give you some other parameters vector theta." "In order to distinguish between these different parameter vectors, I'm going to use a superscript 1, superscript 2 there where theta superscript" "1 just means the parameters" "I get by fitting this model to my training data, and theta superscript 2 just means the parameters I get by fitting this quadratic function to my training ata and so on." "And by fitting a cubic model I get parameters theta 3 up to, you know, say theta 10." "And one thing we could do is then take these parameters and look at the test set error." "So I can compute on my test set, j test of 1, j test of" "theta 2 and so on, j test of theta 3 and so on." "So I'm going to take each of my hypothesis with the corresponding and just measure the performance on the test set." "Now one thing I could do then is, in order to select one of these models, I could then see which model has the lowest test sets error, and lets just say for this example, that I" "ended up choosing the fifth order polynomial." "So this seems reasonable so far." "By now, lets say, I want to take my fit hypothesis, this fifth order model and let's say I want to ask how well this model generalized." "One thing I could do is look at how well my fifth order polynomial hypothesis, had done on my test set." "But the problem is this will not to be a fair estimate of how well my hypothesis generalizes." "And the reason is, what we've done is, we've fit this extra the parameter d, that is this degree of polynomial, and we'll fit that parameter d using the test set." "Namely, we chose the value of d that gave us the best possible performance on the test set, and so, the performance of my parameter vector theta five on the test set, that's likely to be to be an overly optimistic" "estimate of generalization error." "Right?" "So that because I have fit this parameter d to my test set, it is no longer fair to evaluate my hypothesis on this test set." "That's because I've fit my parameters to the test set." "I've chosen the degree d of polynomial using the test set." "And so our hypothesis is likely to do better on this test set than it would on new examples that it hasn't seen before and that's which is what we hear about." "So, just to reiterate on the previous slide we saw that if we fit some set of parameters, say theta" "0, theta 1, and so on, to some training set, then the performance of the fitted model on the training set is not predictive of how well the hypothesis we generalized the new examples; is because these" "parameters would fit to the training set." "So they are likely to do well on the training set, even if the parameters don't do well on other examples." "And in the procedure I've just described on this slide, we've just done the same thing and specifically what we did is we fit this parameter d to the test set." "And by having fit the parameter to the test set, this means that the performance of the hypothesis on that test set may not be a fair estimate of how well the hypothesis is likely to do on examples we haven't seen before." "To address this problem in a model selection setting, if we want to evaluate a hypothesis this is usually what we do instead." "Given the data set, instead of just splitting it into a train and test set, what we are going to do is instead split it into three pieces." "And the first piece is going to be called the training set as usual." "So you call this first part, the training set, and then were going to coddle the second piece of data, which is called the cross-validation set, and" "I'm going to abbreviate cross-validation" "CV, and the second piece of this data, I'm going to call the cross-validation set" "cross-validation, and I am going to abbreviate cross-validation as CV." "Sometimes it's also called the validation set, instead of cross-validation set." "And then the last part I am going to call my usual test set." "And the pretty typical ratio" "I wish to split these things; would be to send 60% of your data to your training set, maybe 20% to your cross-validation set, and 20% to your test set." "And these numbers can vary a little bit but this sort of ratio will be pretty typical." "And so our training set will now be only, maybe" "60% of the data, and our cross-validation set or our validation set will have some number of examples." "I'm going to denote that M subscript cv, so that's the number of cross-validation examples." "And as following our earlier notational convention, I'm going to use XiCV, YiCV." "Following our earlier notational convention" "I'm going to use" "XiCV, YiCV to denote the i cross-validation example." "And finally we also have a test set over here;" "with M subscript test, being the number of test examples." "So, now that we have defined the training validation or cross validation and test sets, we can also define the training error, cross validation error, and test error." "So here's my training error, and I'm just writing this as" "J j subscript train of theta." "This is pretty much the same thing." "It's usually the same thing as the j of theta that we'll be writing so far, this is just a training set error you guys measure on your training set." "And then j subscript cv is my cause validation error is pretty much what you'd expect." "Just select the training error, except measure it on the cross-validation data set, and here's my test set error, same as before." "So when theta with the model selection problem like this is, instead of using the test set to select the model, we're instead going to use validation set or the cross-validation set to select the model." "Concretely, we're going to first take our first hypothesis, take this first model and say, minimize the cos function, and this would give me some parameter vector theta for the linear model" "and as before I'm going to put the superscript 1 just to denote that this is a parameter for the linear model." "We do the same thing for the quadratic model, get some parameter vector theta 2, get some parameter vectors there 3, and so on down to, say, the tenth by the polynomial, and what we I'm going to do is, instead of" "testing these hypothesis on the test set, instead I'm going to test them on the cross-validation set." "I'm going to measure j subscript cv, to see how well each of these hypothesis do on my cross validation set and then" "I'm going to pick the hypothesis with the lowest cross-validation error." "So for this example, let's say for the sake of argument that it was my fourth order polynomial that had the lowest cross-validation error." "So in that case, I'm going to pick this fourth order polynomial model and finally what this means is that that parameter d, remember d was the degree of polynomial, right d equals 2, d equals 3," "up to d equals 10." "What we've done is we fit that parameter d, and we'll set d equals 4, and we did so using the cross-validation set." "And so this degree of polynomial, so the parameter is no longer fit to the test set." "And so we've now saved a way the test set and we can use the test set to measure or to estimate the generalization error of the model that was selected" "by this algorithm." "So, that was model selection and how you can take your data and split it into a train validation and test set, and use your cross validation data to select model and evaluate it on the test set." "One final note:" "I should say that in the machine learning as of this practice today, there are many people that will do that early thing that I talked about, and said that isn't such a good idea of selecting your model using the" "test set and they're using the same test set to report the error, as though selecting your degree of polynomial on the test set, and then reporting the error on the test set as though that were good estimate of generalization error." "That sort of practice is unfortunately many people do do it; and if you have a massive massive test set is maybe not a terrible thing to do, but" "most practitioners of machine learning tend to advise" "against that and is considered better practice to have separate training validations of test sets." "I'll just warn you that just sometimes people do you know, use the same data for the purpose of the validation set and for the purpose of the test sets." "You only have a training set and the test set and that's because that's good practice." "So, you will see some people do it but if possible I will recommend against doing." "If you run the learning algorithm and it doesn't do as well as you are hoping, almost all the time it will be because you have either a high bias problem or a high variance problem." "In other words they're either an underfitting problem or an overfitting problem." "And in this case it's very important to figure out which of these two problems is bias or variance or a bit of both that you actually have." "Because knowing which of these two things is happening would give a very strong indicator for whether the useful and promising ways to try to improve your algorithm." "In this video, I would like to delve more deeply into this bias and various issue and understand them better as well as figure out how to look at and evaluate knows whether or not we might have a bias problem or a variance problem." "Since this would be critical to figuring out how to improve the performance of learning algorithm that you implement." "So you've already seen this figure a few times, where if you fit two simple hypothesis, like a straight line that that underfits the data." "If you fit a two complex hypothesis, then that might fit the training set perfectly but overfit the data and this may be hypothesis of some intermediate level of complexity, of some, maybe degree two polynomials are not too low and not too high degree." "That's just right." "And gives you the best generalization error out of these options." "Now that we're armed with the notion of training and validation in test sets, we can understand the concepts of bias and variance a little bit better." "Concretely, let our training error and cross validation error be defined as in the previous videos, just say, the squared error, the average squared error as measured on the 20 sets or as measured on the cross validation set." "Now let's plot the following figure." "On the horizontal axis I am going to plot the degree of polynomial, so as I go the right" "I'm going to be fitting higher and higher order polynomials." "So, we'll do that for this figure, where maybe d equals 1, were going to be fitting very simple functions where as we are the right of this this may be d equals 4 or relatively may" "be even larger numbers." "I'm going to be fitting very complex high order polynomials that might fit the training set with much more complex functions" "whereas we're here on the right of the horizontal axis, I have much larger values of these of a much higher degree polynomial, and so here that is going to correspond to fitting much more complex functions to your" "training set." "Let's look at the training error and cause-validation error and plot them on this figure." "Let's start with the training error." "As we increase the degree of the polynomial, we're going to" "fit our training set better and better and so, if d equals 1 that ever rose to the high training error." "If we have a very high degree of polynomial, our training error is going to be really low." "Maybe even zero, because it will fit the training set really well." "And so as we increase of the greater polynomial we find typically that the training error decreases, so I'm going to write j subscript train of theta there, because our training error tends to decrease with the degree" "of the polynomial that we fit to the data." "Next, let's look at the cross validation error." "Often that matter, if we look at the test set error we'll get a pretty similar result as if we were to plot the" "cross validation error." "So, we know that if d equals 1, we're fitting a very simple function, and so we may be underfitting the training set, and so we're going to go very high cross-validation error." "If we fit, you know, an intermediate degree polynomial; we have a d equals 2 in our example in the previous slide, we are going to have a much lower cross-validation error, because we are just fitting, finding" "a much better fit to the data." "And conversely if d were too high, so if d took on say a value of four, then we're again overfitting and so we end up with a high value for cross-validation error." "So if you were to vary this smoothly and plot a curve you might end up with a curve like that, where that's Jcv of theta, and again if you plot j test of theta you get something very similar." "And so this sort of plot also helps us to better understand the notions of bias and variance." "Concretely, if you have a learning algorithm that's not performing as well as you wanted it to, how can you figure out if your learning algorithm is suffering." "Concretly, suppose you have applied a learning algorithm and it is not performing as well as your are hoping, so your cross-validation set error or your test set error is high." "How can we figure out if the learning algorithm is suffering from high bias or if it is suffering from high variance." "So the setting of a cross-validation error being high corresponds to either this regime or this regime." "So this regime on the left corresponds to a high bias problem, that is, if you are fitting an overly low order polynomial such as a plus one, when we really needed a higher order polynomial to fit the data." "Whereas in contrast, this regime corresponds to a high variance problem." "That is, if d--the degree of polynomial--was too large for the data set that we have." "And this figure gives us a clue for how to distinguish between these two cases." "Concretely, for the high bias case, that is, the case of under fitting, what we find is that both" "the cross validation error and the training error are going to be high." "So, if your algorithm is suffering from a bias problem," "the training set error would be high and you may find that the cross validation error will also be high." "It might be close, maybe just slightly higher then a training error." "And so, if you see this combination, that's a sign that your algorithm may be suffering from high bias." "In contrast; if your algorithm is suffering from high variance; then, if you look here, we'll notice that, J train, that is the training error, is going to be low." "That is, you're fitting the training set very well." "Whereas, your cross validation error, assuming that this say the squared error which we're trying to minimize." "Whereas in contrast; your error on a cross validation set or your cross function like cross validation set, will be much bigger than your training set error." "This double greater than sign, here, it means much bigger than, all right." "So, it's much greater than to multiply great to great." "So this is a double greater than sign, that is the map symbol for much greater than denoted by two greater than signs." "And so if you see this combination, then what you find." "And so if you see this combination of values, then that is a clue that your learning algorithm may be suffering from high variance and might be overfitting." "And the key that distinguishes these two cases is if you have a high bias problem your training set error will also be high as your hypothesis just not fitting the training set well." "And if you have a high variance problem, your training set error will usually be low, that is much lower than the cross validation error." "So, hopefully that gives you a somewhat better understanding of the two problems of bias and variance." "I still have a lot more to say about bias and variance in the next few videos." "But what we will see later; is that by diagnosing, whether a learning algorithm may be suffering from high bias or a high variance." "I'll show you even more details on how to do that in later videos." "We'll see that by figuring out whether a learning algorithm may be suffering from high bias or a combination of both that that would give us much better guidance for what might be promising things to try in order to improve the performance of the learning algorithm." "You've seen how regularization can help prevent overfitting, but how does it affect the bias and variance of a learning algorithm?" "In this video, I like to go deeper into the issue of bias and variance, and talk about how it interacts with, and is effected by, the regularization of your learning algorithm." "Suppose we fit a linear regression model with a very high order polynomial, but to prevent overfitting, we are going to use regularization as shown here." "Suppose we're fitting a high order polynomial like that shown here, but to prevent overfitting, we're going to use regularization, like that shown here, so we have this regularization term to try to keep the values of the parameters small." "And as usual, the regularization sums from j equals 1 to m rather than j equals 0 to m." "Let's consider three cases." "The first is the case of a very large value of the regularization parameter lambda, such as if lambda were equal to 10,000s of huge value." "In this case, all of these parameters, theta 1, theta 2, theta 3 and so on will be heavily penalized and so, what ends up with most of these parameter values being close to 0 and the hypothesis will be" "roughly h or x just equal or approximately equal to theta 0, and so we end up a hypothesis that more or less looks like that." "This is more or less a flat, constant straight line." "And so this hypothesis has high bias and a value underfits this data set." "So the horizontal straight line is just not a very good model for this data set." "At the other extreme beam is if we have a very small value of lambda, such as if lambda were equal to 0." "In that case, given that we're fitting a high order polynomial, this is a usual overfitting setting." "In that case, given that we're fitting a high order polynomial, basically without regularization or with very minimal regularization, we end up with our usual high variance, overfitting setting, because basically if lambda is equal to zero, we are just" "fitting with our regularization so that overfits the hypothesis" "and is only if we have some intermediate value of lambda that is neither too large nor too small that we end up with parameters theta that we end up that give us a reasonable fit to this data." "So how can we automatically choose a good value for the regularization parameter lambda?" "Just to reiterate, here is our model and here is our learning algorithm subjective." "For the setting where we're using regularization, let me define j train of theta to be something different to be the optimization objective but without the regularization term." "Previously, in earlier video when we are not using regularization, I define j train of theta to be the same as j of theta as the cost function but when we are using regularization with this extra lambda term" "we're going to define j train my training set error, to be just my sum of squared errors on the training set, or my average squared error on the training set without taking into account that regularization chart." "And similarly, I'm then also going to define the cross-validation set error when the test set error, as before to be the average sum of squared errors on the cross-validation and the test sets." "So just to summarize, my definitions of J train and" "J C V and J" "Test are just the average squared error, or one half of the average squared error on my training validation and test sets without the extra" "regularization chart." "So, this is how we can automatically choose the regularization parameter lambda." "What I usually do is may be have some range of values of lambda I want to try it." "So I might be considering not using regularization," "or here are a few values I might try." "I might be considering along because of O1, O2 from O4 and so on." "And you know, I usually step these up in multiples of two until some maybe larger value this in multiples of two you" "I actually end up with 10.24;" "it's ten exactly, but you know, this is close enough and the 35 decimal places won't affect your result that much." "So, this gives me, maybe twelve different models, that I'm trying to select amongst, corresponding to" "12 different values of the regularization parameter lambda and of course, you can also go to values less than 0.01 or values larger than 10, but I've just truncated it here for convenience." "Given each of these 12 models, what we can do is then the following:" "we take this first model with lambda equals 0, and minimize my cos function j of theta and this would give me some parameter vector theta and similar to the earlier video, let me just denote this as" "theta superscript 1." "And then I can take my second model, with lambda set to 0.01 and minimize my cos function, now using lambda equals 0.01 of course, to get some different parameter vector theta, we need to know that theta 2," "and for that I end up with theta 3 so that this is correct for my third model, and so on, until for for my final model with lambda set to 10, or 10.24, or I end up with this theta 12." "Next I can take all of these hypotheses, all of these parameters, and use my cross-validation set to evaluate them." "So I can look at my first model, my second model, fits with these different values of the regularization parameter and evaluate them on my cross-validation set - basically measure the average squared error of each of these parameter" "vectors theta on my cross-validation set." "And I would then pick whichever one of these 12 models gives me the lowest error on the cross-validation set." "And let's say, for the sake of this example, that I end up picking theta 5, the fifth order polynomial, because that has the Noah's cross-validation error." "Having done that, finally, what" "I would do if I want to report a test set error is to take the parameter theta" "5 that I've selected and look at how well it does on my test set." "And once again here is as if we fit this parameter theta to my cross-validation set, which is why I am saving aside a separate test set that I am going to use to get a better estimate of how" "well my a parameter vector theta will generalize to previously unseen examples." "So that's model selection applied to selecting the regularization parameter lambda." "The last thing" "I'd like to do in this video, is get a better understanding of how cross-validation and training error vary as we as we vary the regularization parameter lambda." "And so just a reminder, that was our original cosine function j of theta, but for this purpose we're going to define" "training error without using the regularization parameter, and cross-validation error without using the regularization parameter and what I'd like to do is plot this J train and plot this Jcv, meaning just how well does my hypothesis do for on" "the training set and how well does my hypothesis do on the cross-validation set as I vary my regularization parameter lambda so as we saw earlier, if lambda is small, then we're not using much regularization and we run a larger risk of overfitting." "Where as if lambda is large, that is if we were on the right part of this horizontal axis, then with a large value of lambda we run the high risk of having a bias problem." "So if you plot J train and Jcv, what you find is that for small values of lambda you can fit the training set relatively well because you're not regularizing." "So, for small values of lambda, the regularization term basically goes away and you're just minimizing pretty much your squared error." "So when lambda is small, you end up with a small value for J train, whereas if lambda is large, then you have a high bias problem and you might not fit your training set so well." "So you end up with a value up there." "So, J train of theta will tend to increase when lambda increases because a large value of lambda corresponds a high bias where you might not even fit your training set well, whereas a small value of lambda corresponds to," "if you can you know freely fit to very high degree polynomials, your data, let's say." "As for the cross-validation error, we end up with a figure like this." "Where, over here on the right, if we have a large value of lambda, we may end up underfitting." "And so, this is the bias regime whereas and cross validation error will be high and let me just leave all that." "So, that's Jcv of theta because with high bias we won't be fitting." "We won't be doing well on the cross-validation set." "Whereas here on the left, this is the high-variance regime." "Where if we have two smaller value of then we may be overfitting the data and so by over fitting the data then it a cross validation error will also be high." "And so, this is what the cross-validation error and what the training error may look like on a training set as we vary the parameter lambda, as we vary the regularization parameter lambda." "And so, once again, it will often be some intermediate value of lambda that you know, subsequent just right or that works best in terms of having a small cross-validation error or a small test set error." "And whereas the curves I've drawn here are somewhat cartoonish and somewhat idealized." "So on a real data set the pros you get may end up looking a little bit more messy and just a little bit more noisy than this." "For some data sets you will really see these poor source of trends and by looking at the plot of the whole or cross validation error, you can either manually, automatically try to select a point that minimizes the cross-validation error and" "select the value of lambda corresponding to low cross-validation error." "When I'm trying to pick the regularization parameter lambda for a learning algorithm, often I find that plotting a figure like this one showed here, helps me understand better what's going on and helps me verify that" "I am indeed picking a good value for the regularization parameter lambda." "So hopefully that gives you more insight into regularization and it's effects on the bias and variance of the learning algorithm." "By know you've seen bias and variance from a lot of different perspectives." "And what I'd like to do in the next video is take a lot of the insights that we've gone through and build on them to put together a diagnostic that's called learning curves, which is a" "tool that I often use to try to diagnose if a learning algorithm may be suffering from a bias problem or a variance problem or a little bit of both." "In this video, I'd like to tell you about learning curves." "Learning curves is often a very useful thing to plot." "If either you wanted to sanity check that your algorithm is working correctly, or if you want to improve the performance of the algorithm." "And learning curves is a tool that I actually use very often to try to diagnose if a physical learning algorithm may be suffering from bias, sort of variance problem or a bit of both." "Here's what a learning curve is." "To plot a learning curve, what" "I usually do is plot j train which is, say," "average squared error on my training set or Jcv which is the average squared error on my cross validation set." "And I'm going to plot that as a function of m, that is as a function of the number of training examples I have." "And so m is usually a constant like maybe I just have, you know, a 100 training examples but what I'm going to do is artificially with use my training set exercise." "So, I deliberately limit myself to using only, say, 10 or 20 or" "30 or 40 training examples and plot what the training error is and what the cross validation is for this smallest training set exercises." "So let's see what these plots may look like." "Suppose I have only one training example like that shown in this this first example here and let's say I'm fitting a quadratic function." "Well, I have only one training example." "I'm going to be able to fit it perfectly right?" "You know, just fit the quadratic function." "I'm going to have 0 error on the one training example." "If I have two training examples." "Well the quadratic function can also fit that very well." "So, even if I am using regularization," "I can probably fit this quite well." "And if I am using no neural regularization," "I'm going to fit this perfectly and if I have three training examples again." "Yeah, I can fit a quadratic function perfectly so if m equals 1 or m equals 2 or m equals 3, my training error on my training set is going to be 0 assuming I'm not using regularization or it may" "slightly large in 0 if" "I'm using regularization and by the way if I have a large training set and I'm artificially restricting the size of my training set in order to J train." "Here if I set" "M equals 3, say, and I train on only three examples, then, for this figure I am going to measure my training error only on the three examples that actually fit my data too" "and so even I have to say a 100 training examples but if I want to plot what my training error is the m equals 3." "What I'm going to do is to measure the training error on the three examples that I've actually fit to my hypothesis 2." "And not all the other examples that I have deliberately omitted from the training process." "So just to summarize what we've seen is that if the training set size is small then the training error is going to be small as well." "Because you know, we have a small training set is going to be very easy to fit your training set very well may be even perfectly now say we have m equals 4 for example." "Well then a quadratic function can be a longer fit this data set perfectly and if I have m equals 5 then you know, maybe quadratic function will fit to stay there so so, then as my training set gets larger." "It becomes harder and harder to ensure that I can find the quadratic function that process through all my examples perfectly." "So in fact as the training set size grows what you find is that my average training error actually increases and so if you plot this figure what you find is that the training set error that is the average error on your hypothesis grows" "as m grows and just to repeat when the intuition is that when m is small when you have very few training examples." "It's pretty easy to fit every single one of your training examples perfectly and so your error is going to be small whereas when m is larger then gets harder all the training examples perfectly and so your training set error becomes" "more larger now, how about the cross validation error." "Well, the cross validation is my error on this cross validation set that I haven't seen and so, you know, when I have a very small training set, I'm not going to generalize well, just" "not going to do well on that." "So, right, this hypothesis here doesn't look like a good one, and it's only when I get a larger training set that, you know, I'm starting to get hypotheses that maybe fit the data somewhat better." "So your cross validation error and your test set error will tend to decrease as your training set size increases because the more data you have, the better you do at generalizing to new examples." "So, just the more data you have, the better the hypothesis you fit." "So if you plot j train, and Jcv this is the sort of thing that you get." "Now let's look at what the learning curves may look like if we have either high bias or high variance problems." "Suppose your hypothesis has high bias and to explain this" "I'm going to use a, set an example, of fitting a straight line to data that, you know, can't really be fit well by a straight line." "So we end up with a hypotheses that maybe looks like that." "Now let's think what would happen if we were to increase the training set size." "So if instead of five examples like what I've drawn there, imagine that we have a lot more training examples." "Well what happens, if you fit a straight line to this." "What you find is that, you end up with you know, pretty much the same straight line." "I mean a straight line that just cannot fit this data and getting a ton more data, well the straight line isn't going to change that much." "This is the best possible straight-line fit to this data, but the straight line just can't fit this data set that well." "So, if you plot across validation error, this is what it will look like." "Option on the left, if you have already a miniscule training set size like you know, maybe just one training example and is not going to do well." "But by the time you have reached a certain number of training examples, you have almost fit the best possible straight line, and even if you end up with a much larger training set size, a much larger value of m," "you know, you're basically getting the same straight line, and so, the cross-validation error" " let me label that - or test set error or plateau out, or flatten out pretty soon, once you reached beyond a certain the number of training examples, unless you pretty much fit the best possible straight line." "And how about training error?" "Well, the training error will again be small." "And what you find in the high bias case is that the training error will end up close to the cross validation error, because you have so few parameters and so much data, at least when m is large." "The performance on the training set and the cross validation set will be very similar." "And so, this is what your learning curves will look like, if you have an algorithm that has high bias." "And finally, the problem with high bias is reflected in the fact that both the cross validation error and the training error are high, and so you end up with a relatively high value of both Jcv and the j train." "This also implies something very interesting, which is that, if a learning algorithm has high bias, as we get more and more training examples, that is, as we move to the right of this figure, we'll" "notice that the cross validation error isn't going down much, it's basically fattened up, and so if learning algorithms are really suffering from high bias." "Getting more training data by itself will actually not help that much,and as our figure example in the figure on the right, here we had only five training." "examples, and we fill certain straight line." "And when we had a ton more training data, we still end up with roughly the same straight line." "And so if the learning algorithm has high bias give me a lot more training data." "That doesn't actually help you get a much lower cross validation error or test set error." "So knowing if your learning algorithm is suffering from high bias seems like a useful thing to know because this can prevent you from wasting a lot of time collecting more training data where it might just not end up being helpful." "Next let us look at the setting of a learning algorithm that may have high variance." "Let us just look at the training error in a around if you have very smart training set like five training examples shown on the figure on the right and if we're fitting say a very high order polynomial," "and I've written a hundredth degree polynomial which really no one uses, but just an illustration." "And if we're using a fairly small value of lambda, maybe not zero, but a fairly small value of lambda, then we'll end up, you know, fitting this data very well that with a function that overfits this." "So, if the training set size is small, our training error, that is, j train of theta will be small." "And as this training set size increases a bit, you know, we may still be overfitting this data a little bit but it also becomes slightly harder to fit this data set perfectly, and so, as the training set size" "increases, we'll find that j train increases, because it is just a little harder to fit the training set perfectly when we have more examples, but the training set error will still be pretty low." "Now, how about the cross validation error?" "Well, in high variance setting, a hypothesis is overfitting and so the cross validation error will remain high, even as we get you know, a moderate number of training examples and, so maybe, the cross validation" "error may look like that." "And the indicative diagnostic that we have a high variance problem," "is the fact that there's this large gap between the training error and the cross validation error." "And looking at this figure." "If we think about adding more training data, that is, taking this figure and extrapolating to the right, we can kind of tell that, you know the two curves, the blue curve and the magenta curve, are converging to each other." "And so, if we were to extrapolate this figure to the right, then it seems it likely that the training error will keep on going up and the" "cross-validation error would keep on going down." "And the thing we really care about is the cross-validation error or the test set error, right?" "So in this sort of figure, we can tell that if we keep on adding training examples and extrapolate to the right, well our cross validation error will keep on coming down." "And, so, in the high variance setting, getting more training data is, indeed, likely to help." "And so again, this seems like a useful thing to know if your learning algorithm is suffering from a high variance problem, because that tells you, for example that it may be be worth your while to see if you can go and get some more training data." "Now, on the previous slide and this slide, I've drawn fairly clean fairly idealized curves." "If you plot these curves for an actual learning algorithm, sometimes you will actually see, you know, pretty much curves, like what I've drawn here." "Although, sometimes you see curves that are a little bit noisier and a little bit messier than this." "But plotting learning curves like these can often tell you, can often help you figure out if your learning algorithm is suffering from bias, or variance or even a little bit of both." "So when I'm trying to improve the performance of a learning algorithm, one thing that I'll almost always do is plot these learning curves, and usually this will give you a better sense of whether there is a bias or variance problem." "And in the next video we'll see how this can help suggest specific actions is to take, or to not take, in order to try to improve the performance of your learning algorithm." "We've talked about how to evaluate learning algorithms, talked about model selection, talked a lot about bias and variance." "So how does this help us figure out what are potentially fruitful, potentially not fruitful things to try to do to improve the performance of a learning algorithm." "Let's go back to our original motivating example and go for the result." "So here is our earlier example of maybe having fit regularized linear regression and finding that it doesn't work as well as we're hoping." "We said that we had this menu of options." "So is there some way to figure out which of these might be fruitful options?" "The first thing all of this was getting more training examples." "What this is good for, is this helps to fix high variance." "And concretely, if you instead have a high bias problem and don't have any variance problem, then we saw in the previous video that getting more training examples," "while maybe just isn't going to help much at all." "So the first option is useful only if you, say, plot the learning curves and figure out that you have at least a bit of a variance, meaning that the cross-validation error is, you know, quite a bit bigger than your training set error." "How about trying a smaller set of features?" "Well, trying a smaller set of features, that's again something that fixes high variance." "And in other words, if you figure out, by looking at learning curves or something else that you used, that have a high bias problem; then for goodness sakes, don't waste your time trying to carefully select out" "a smaller set of features to use." "Because if you have a high bias problem, using fewer features is not going to help." "Whereas in contrast, if you look at the learning curves or something else you figure out that you have a high variance problem, then, indeed trying to select out a smaller set of features, that might indeed be a very good use of your time." "How about trying to get additional features, adding features, usually, not always, but usually we think of this as a solution" "for fixing high bias problems." "So if you are adding extra features it's usually because" "your current hypothesis is too simple, and so we want to try to get additional features to make our hypothesis better able to fit the training set." "And similarly, adding polynomial features;" "this is another way of adding features and so there is another way to try to fix the high bias problem." "And, if concretely if your learning curves show you that you still have a high variance problem, then, you know, again this is maybe a less good use of your time." "And finally, decreasing and increasing lambda." "This are quick and easy to try," "I guess these are less likely to be a waste of, you know, many months of your life." "But decreasing lambda, you already know fixes high bias." "In case this isn't clear to you, you know, I do encourage you to pause the video and think through this that convince yourself that decreasing lambda helps fix high bias, whereas increasing lambda fixes high variance." "And if you aren't sure why this is the case, do pause the video and make sure you can convince yourself that this is the case." "Or take a look at the curves that we were plotting at the end of the previous video and try to make sure you understand why these are the case." "Finally, let us take everything we have learned and relate it back to neural networks and so, here is some practical advice for how I usually choose the architecture or the connectivity pattern of the neural networks I use." "So, if you are fitting a neural network, one option would be to fit, say, a pretty small neural network with you know, relatively few hidden units, maybe just one hidden unit." "If you're fitting a neural network, one option would be to fit a relatively small neural network with, say," "relatively few, maybe only one hidden layer and maybe only a relatively few number of hidden units." "So, a network like this might have relatively few parameters and be more prone to underfitting." "The main advantage of these small neural networks is that the computation will be cheaper." "An alternative would be to fit a, maybe relatively large neural network with either more hidden units--there's a lot of hidden in one there--or with more hidden layers." "And so these neural networks tend to have more parameters and therefore be more prone to overfitting." "One disadvantage, often not a major one but something to think about, is that if you have a large number of neurons in your network, then it can be more computationally expensive." "Although within reason, this is often hopefully not a huge problem." "The main potential problem of these much larger neural networks is that it could be more prone to overfitting and it turns out if you're applying neural network very often using a large neural network often it's actually the larger, the better" "but if it's overfitting, you can then use regularization to address overfitting, usually using a larger neural network by using regularization to address is overfitting that's often more effective than using a smaller neural network." "And the main possible disadvantage is that it can be more computationally expensive." "And finally, one of the other decisions is, say, the number of hidden layers you want to have, right?" "So, do you want one hidden layer or do you want three hidden layers, as we've shown here, or do you want two hidden layers?" "And usually, as I think I said in the previous video, using a single hidden layer is a reasonable default, but if you want to choose the number of hidden layers, one other thing you can try is" "find yourself a training cross-validation, and test set split and try training neural networks with one hidden layer or two hidden layers or three hidden layers and see which of those neural networks performs best on the cross-validation sets." "You take your three neural networks with one, two and three hidden layers, and compute the cross validation error at Jcv and all of them and use that to select which of these is you think the best neural network." "So, that's it for bias and variance and ways like learning curves, who tried to diagnose these problems." "As far as what you think is implied, for one might be truthful or not truthful things to try to improve the performance of a learning algorithm." "If you understood the contents of the last few videos and if you apply them you actually be much more effective already and getting learning algorithms to work on problems and even a large fraction, maybe the majority of practitioners" "of machine learning here in" "Silicon Valley today doing these things as their full-time jobs." "So I hope that these pieces of advice on by experience in diagnostics" "will help you to much effectively and powerfully apply learning and get them to work very well." "In the next few videos I'd like to talk about machine learning system design." "These videos will touch on the main issues that you may face when designing a complex machine learning system," "and will actually try to give advice on how to strategize putting together a complex machine learning system." "In case this next set of videos seems a little disjointed that's because these videos will touch on a range of the different issues that you may come across when designing complex learning systems." "And even though the next set of videos may seem somewhat less mathematical, I think that this material may turn out to be very useful, and potentially huge time savers when you're building big machine learning systems." "Concretely, I'd like to begin with the issue of prioritizing how to spend your time on what to work on, and I'll begin with an example on spam classification." "Let's say you want to build a spam classifier." "Here are a couple of examples of obvious spam and non-spam emails." "if the one on the left tried to sell things." "And notice how spammers will deliberately misspell words, like Vincent with a 1 there, and mortgages." "And on the right as maybe an obvious example of non-stamp email, actually email from my younger brother." "Let's say we have a labeled training set of some number of spam emails and some non-spam emails denoted with labels y equals 1 or 0, how do we build a classifier using supervised learning to distinguish between spam and non-spam?" "In order to apply supervised learning, the first decision we must make is how do we want to represent x, that is the features of the email." "Given the features x and the labels y in our training set, we can then train a classifier, for example using logistic regression." "Here's one way to choose a set of features for our emails." "We could come up with, say, a list of maybe a hundred words that we think are indicative of whether e-mail is spam or non-spam, for example, if a piece of e-mail contains the word 'deal'" "maybe it's more likely to be spam if it contains the word 'buy' maybe more likely to be spam, a word like 'discount' is more likely to be spam, whereas if a piece of email contains my name," "Andrew, maybe that means the person actually knows who" "I am and that might mean it's less likely to be spam." "And maybe for some reason I think the word "now" may be indicative of non-spam because" "I get a lot of urgent emails, and so on, and maybe we choose a hundred words or so." "Given a piece of email, we can then take this piece of email and encode it into a feature vector as follows." "I'm going to take my list of a hundred words and sort them in alphabetical order say." "It doesn't have to be sorted." "But, you know, here's a, here's my list of words, just count and so on, until eventually" "I'll get down to now, and so on and given a piece of e-mail like that shown on the right, I'm going to check and see whether or not each of these words appears in the e-mail and then" "I'm going to define a feature vector x where in this piece of an email on the right, my name doesn't appear so I'm gonna put a zero there." "The word "by" does appear, so I'm gonna put a one there and I'm just gonna put one's or zeroes." "I'm gonna put a one even though the word "by" occurs twice." "I'm not gonna recount how many times the word occurs." "The word "due" appears, I put a one there." "The word "discount" doesn't appear, at least not in this this little short email, and so on." "The word "now" does appear and so on." "So I put ones and zeroes in this feature vector depending on whether or not a particular word appears." "And in this example my feature vector would have to mention one hundred," "if I have a hundred, if if I chose a hundred words to use for this representation and each of my features Xj will basically be 1 if" "you have a particular word that, we'll call this word j, appears in the email and Xj" "would be zero otherwise." "Okay." "So that gives me a feature representation of a piece of email." "By the way, even though I've described this process as manually picking a hundred words, in practice what's most commonly done is to look through a training set, and in the training set depict the most frequently occurring n words" "where n is usually between ten thousand and fifty thousand, and use those as your features." "So rather than manually picking a hundred words, here you look through the training examples and pick the most frequently occurring words like ten thousand to fifty thousand words, and those form the features that you are going to use to represent your email for spam classification." "Now, if you're building a spam classifier one question that you may face is, what's the best use of your time in order to make your spam classifier have higher accuracy, you have lower error." "One natural inclination is going to collect lots of data." "Right?" "And in fact there's this tendency to think that, well the more data we have the better the algorithm will do." "And in fact, in the email spam domain, there are actually pretty serious projects called Honey" "Pot Projects, which create fake email addresses and try to get these fake email addresses into the hands of spammers and use that to try to collect tons of spam email, and therefore you know, get a lot of spam data to train learning algorithms." "But we've already seen in the previous sets of videos that getting lots of data will often help, but not all the time." "But for most machine learning problems, there are a lot of other things you could usually imagine doing to improve performance." "For spam, one thing you might think of is to develop more sophisticated features on the email, maybe based on the email routing information." "And this would be information contained in the email header." "So, when spammers send email, very often they will try to obscure the origins of the email, and maybe use fake email headers." "Or send email through very unusual sets of computer service." "Through very unusual routes, in order to get the spam to you." "And some of this information will be reflected in the email header." "And so one can imagine, looking at the email headers and trying to develop more sophisticated features to capture this sort of email routing information to identify if something is spam." "Something else you might consider doing is to look at the email message body, that is the email text, and try to develop more sophisticated features." "For example, should the word 'discount' and the word 'discounts' be treated as the same words or should we have treat the words 'deal' and 'dealer' as the same word?" "Maybe even though one is lower case and one in capitalized in this example." "Or do we want more complex features about punctuation because maybe spam is using exclamation marks a lot more." "I don't know." "And along the same lines, maybe we also want to develop more sophisticated algorithms to detect and maybe to correct to deliberate misspellings, like mortgage, medicine, watches." "Because spammers actually do this, because if you have watches" "with a 4 in there then well, with the simple technique that we talked about just now, the spam classifier might not equate this as the same thing as the word "watches," and so it may have a harder time realizing" "that something is spam with these deliberate misspellings." "And this is why spammers do it." "While working on a machine learning problem, very often you can brainstorm lists of different things to try, like these." "By the way, I've actually worked on the spam problem myself for a while." "And I actually spent quite some time on it." "And even though I kind of understand the spam problem," "I actually know a bit about it," "I would actually have a very hard time telling you of these four options which is the best use of your time so what happens, frankly what happens far too often is that a research group or product group will randomly fixate on one of these options." "And sometimes that turns out not to be the most fruitful way to spend your time depending, you know, on which of these options someone ends up randomly fixating on." "By the way, in fact, if you even get to the stage where you brainstorm a list of different options to try, you're probably already ahead of the curve." "Sadly, what most people do is instead of trying to list out the options of things you might try, what far too many people do is wake up one morning and, for some reason, just, you know, have a weird" "gut feeling that, "Oh let's have a huge honeypot project to go and collect tons more data"" "and for whatever strange reason just sort of wake up one morning and randomly fixate on one thing and just work on that for six months." "But I think we can do better." "And in particular what I'd like to do in the next video is tell you about the concept of error analysis" "and talk about the way where you can try to have a more systematic way to choose amongst the options of the many different things you might work, and therefore be more likely to select what is actually a good way" "to spend your time, you know for the next few weeks, or next few days or the next few months." "In the last video, I talked about how when faced with a machine learning problem, there are often lots of different ideas on how to improve the algorithm." "In this video let's talk about the concepts of error analysis which will help me give you a way to more systematically make some of these decisions." "If you're starting work on a machine learning product or building a machine learning application, it is often considered very good practice to start, not by building a very complicated system with lots of complex features and so" "on, but to instead start by building a very simple algorithm, the you can implement quickly." "And when I start on a learning problem, what I usually do is spend at most one day, literally at most 24 hours to try to get something really quick and dirty." "Frankly not at all sophisticated system." "But get something really quick and dirty running and implement it and then test it on my cross validation data." "Once you've done that, you can then plot learning curves." "This is what we talked about in the previous set of videos." "But plot learning curves of the training and test errors to try to figure out if your learning algorithm may be suffering from high bias or high variance or something else and use that to try to decide if having more data" "and more features and so on are likely to help." "And the reason that this is a good approach is often when you're just starting out on a learning problem, there's really no way to tell in advance whether you need more complex features or whether you need" "more data or something else." "And it's just very hard to tell in advance, that is in the absence of evidence, in the absence of seeing a learning curve, it's just incredibly difficult to figure out where you should be spending your time." "And it's often by implementing even a very, very quick and dirty implementation and by plotting learning curves that that helps you make these decisions." "So if you like, you can think of this as a way of avoiding what's sometimes called premature optimization in computer programming." "And this is idea that just says that we should let evidence guide our decisions on where to spend our time rather than use gut feeling, which is often wrong." "In addition to plotting learning curves, one other thing that's often very useful to do is what's called error analysis." "And what I mean by that is that when building, say a spam classifier, I will often look at my cross validation set and manually look at the emails that my algorithm is making errors on." "So, look at the spam emails and non-spam emails that the algorithm is misclassifying, and see if you can spot any systematic patterns in what type of examples it is misclassifying." "And often by doing that, this is the process that would inspire" "you to design new features." "Or they'll tell you whether the current things or current shortcomings of the system and give you the inspiration you need to come up with improvements to it." "Concretely, here's a specific example." "Let's say you've built a spam classifier and you have 500 examples in your cross-validation set." "And let's say in this example, that the algorithm has a very high error rate, and it misclassifies a hundred of these cross-validation examples." "So what I do is manually examine these 100 errors, and manually categorize them, based on things like what type of email it is and what cues or what features you think might have helped the algorithm classify them incorrectly." "So, specifically, by what type of email it is, you know, if I look through these hundred errors I may find that maybe the most common types of spam emails in misclassifies are maybe emails on pharmacy, so basically" "these are emails trying to sell drugs, maybe emails that are trying to sell replicas - those are those fake watches fake you know, random things." "Maybe have some emails trying to steal passwords." "These are also called phishing emails." "But that's another big category of emails and maybe other categories." "So, in terms of classify what type of email it is, I would actually go through and count up, you know, of my 100 emails, maybe I find that twelve of the mislabeled emails are pharma emails." "And maybe four of them are emails trying to sell replicas, they sell fake watches or something." "And maybe I find that 53 of them are these, what's called phishing emails, basically emails trying to persuade you to give them your password, and 31 emails are other types of emails." "And it's by counting up the number of emails in these different categories that you might discover, for example, that the algorithm is doing really particularly" "poorly on emails trying to steal passwords, and that may suggest that it might be worth your effort to look more carefully at that type of email, and see if you can come up with better features to categorize them correctly." "And also, what I might do is look at what cues, or what features, additional features might have helped the algorithm classify the emails." "So let's say that some of our hypotheses about things or features that might help us classify emails better are trying to detect deliberate misspellings versus unusual email routing versus unusual, you know," "spamming punctuation, such as people use a lot of exclamation marks." "And once again, I would manually go through and let's say" "I find five cases of this, and 16 of this, and 32 of this and a bunch of other types of emails as well." "And if this is what you get on your cross validation set then it really tells you that, you know, maybe deliberate spelling is a sufficiently rare phenomenon that maybe is not really worth all your time trying to write" "algorithms to detect that." "But if you find a lot of spammers are using, you know, unusual punctuation then maybe that's a strong sign that it might actually be worth your while to spend the time to develop more sophisticated features based on the punctuation." "So, this sort of error analysis which is really the process of manually examining the mistakes that the algorithm makes, can often help guide you to the most fruitful avenues to pursue." "And this also explains why I often recommend implementing a quick and dirty implementation of an algorithm." "What we really want to do is figure out what are the most difficult examples for an algorithm to classify." "And very often for different algorithms, for different learning algorithms, they'll often find, you know, similar categories of examples difficult." "And by having a quick and dirty implementation, that's often a quick way to let you identify some errors and quickly identify what are the hard examples so that you can focus your efforts on those." "Lastly, when developing learning algorithms, one other useful tip is to make sure that you have a way, that you have a numerical evaluation of your learning algorithm." "Now what I mean by that is that if you're developing a learning algorithm, it is often incredibly helpful if you have a way of evaluating your learning algorithm that just gives you back a single real number." "Maybe accuracy, maybe error." "But the single real number that tells you how well your learning algorithm is doing." "I'll talk more about this specific concepts in later videos, but here's a specific example." "Let's say we are trying to decide whether or not we should treat words like discount, discounts, discounter, discounting, as the same word." "So maybe one way to do that is to just look at the first few characters in a word." "Like, you know, if you just look at the first few characters of a word, then you figure out that maybe all of these words are roughly - have similar meanings." "In natural language processing, the way that this is done is actually using a type of software called stemming software." "If you ever want to do this yourself, search on a web search engine for the" "Porter Stemmer and that would be, you know, one reasonable piece of software for doing this sort of stemming, which will let you treat all of these discount, discounts, and so on as the same word." "But using a stemming software that basically looks at the first few alphabets of the word more or less, it can help but it can hurt." "And it can hurt because, for example, this software may mistake the words universe and university as being the same thing because, you know, these two words start off with very similar characters, with the same alphabets." "So if you're trying to decide whether or not to use stemming software for a stem classifier, it is not always easy to tell." "And in particular, error analysis may not actually be helpful" "for deciding if this sort of stemming idea is a good idea." "Instead, the best way to figure out if using stemming software is good to help your classifier is if you have a way to very quickly just try it and see if it works." "And in order to do this, having a way to numerically evaluate your algorithm, is going to be very helpful." "Concretely, maybe the most natural thing to do is to look at the cross validation error of the algorithm's performance with and without stemming." "So, if you run your algorithm without stemming and you end up with, let's say, five percent classification error, and you re-run it and you end up with, let's say, three percent classification error, then this" "decrease in error very quickly allows you to decide that, you know, it looks like using stemming is a good idea." "For this particular problem, there's a very natural single real number evaluation metric, namely, the cross validation error." "We'll see later, examples where coming up with this, sort of, single row number evaluation metric may need a little bit more work." "But as we'll see in the later video, doing so would also then let you make these decisions much more quickly of, say, whether or not to use stemming." "And just this one more quick example." "Let's say that you're also trying to decide whether or not to distinguish between upper versus lower case." "So, you know, is the red mom with uppercase M versus lower case m," "I mean, should that be treated as the same word or as different words?" "Should these be treated as the same feature or as different features?" "And so once again, because we have a way to evaluate our algorithm, if you try this out here, if" "I stop distinguishing upper and lower case, maybe I end up with 3.2% error and I find that therefore this does worse than, you know, if I use only stemming, and so this lets me very quickly decide to go" "ahead and to distinguish or to not distinguish between upper and lower case." "So when you' re developing a learning algorithm, very often you'll be trying out lots of new ideas and lots of new versions of your learning algorithm." "If every time you try out a new idea if you end up manually examining a bunch of examples, you begin to see better or worse, you know, that's going to make it really hard to make decisions" "on do you use stemming or not." "Do you distinguish upper or lowercase or not?" "But by having a single rule number evaluation metric, you can then just look and see oh, did the error go up or go down?" "And you can use that much more rapidly, try out new ideas and almost right away tell if your new idea has improved or worsened the performance of the learning algorithm and this will let you often make much faster progress." "So the recommended, strongly recommended way to do error analysis is on the cross validation set rather than the test set." "But, you know, there are people that will do this on the test set even though that's definitely a less mathematically appropriate set of your list, recommended what you think to do than to do error analysis on your" "cross validation sector." "So, to wrap up this video, when starting on the new machine learning problem, what" "I almost always recommend is to implement a quick and dirty implementation of your learning algorithm." "And I've almost never seen anyone spend too little time on this quick and dirty implementation." "I pretty much only ever see people spend much too much time building their first, you know, supposedly quick and dirty implementations." "So really, don't worry about it being too quick, or don't worry about it being too dirty." "But really implement something as quickly as you can, and once you have the initial implementation this is then a powerful tool for deciding where to spend your time next, because first we can look at the errors it makes," "and do this sort of error analysis to see what mistakes it makes and use that to inspire further development." "And second, assuming your quick and dirty implementation incorporated a single real number evaluation metric, this can then be a vehicle for you to try out different ideas and quickly see if the different ideas you're trying out" "are improving the performance of your algorithm and therefore let you maybe much more quickly make decisions about what things to fold, and what things to incorporate into your learning algorithm." "In the previous video, I talked about error analysis and the importance of having error metrics, that is of having a single real number evaluation metric for your learning algorithm to tell how well it's doing." "In the context of evaluation and of error metrics, there is one important case, where it's particularly tricky to come up with an appropriate error metric, or evaluation metric, for your learning algorithm." "That case is the case of what's called skewed classes." "Let me tell you what that means." "Consider the problem of cancer classification, where we have features of medical patients and we want to decide whether or not they have cancer." "So this is like the malignant versus benign tumor classification example that we had earlier." "So let's say y equals 1 if the patient has cancer and y equals 0 if they do not." "We have trained the progression classifier and let's say we test our classifier on a test set and find that we get 1 percent error." "So, we're making 99% correct diagnosis." "Seems like a really impressive result, right." "We're correct 99% percent of the time." "But now, let's say we find out that only 0.5 percent of patients in our training test sets actually have cancer." "So only half a percent of the patients that come through our screening process have cancer." "In this case, the 1% error no longer looks so impressive." "And in particular, here's a piece of code, here's actually a piece of non learning code that takes this input of features x and it ignores it." "It just sets y equals 0 and always predicts, you know, nobody has cancer and this algorithm would actually get" "0.5 percent error." "So this is even better than the 1% error that we were getting just now and this is a non learning algorithm that you know, it is just predicting y equals 0 all the time." "So this setting of when the ratio of positive to negative examples is very close to one of two extremes, where, in this case, the number of positive examples is much, much smaller than the number of negative examples because y" "equals one so rarely, this is what we call the case of skewed classes." "We just have a lot more of examples from one class than from the other class." "And by just predicting y equals 0 all the time, or maybe our predicting y equals 1 all the time, an algorithm can do pretty well." "So the problem with using classification error or classification accuracy as our evaluation metric is the following." "Let's say you have one joining algorithm that's getting 99.2% accuracy." "So, that's a 0.8% error." "Let's say you make a change to your algorithm and you now are getting" "99.5% accuracy." "That is 0.5% error." "So, is this an improvement to the algorithm or not?" "One of the nice things about having a single real number evaluation metric is this helps us to quickly decide if we just need a good change to the algorithm or not." "By going from 99.2% accuracy to 99.5% accuracy." "You know, did we just do something useful or did we just replace our code with something that just predicts y equals zero more often?" "So, if you have very skewed classes it becomes much harder to use just classification accuracy, because you can get very high classification accuracies or very low errors, and it's not always clear if doing so is really improving" "the quality of your classifier because predicting y equals 0 all the time doesn't seem like a particularly good classifier." "But just predicting y equals 0 more often can bring your error down to, you know, maybe as low as 0.5%." "When we're faced with such a skewed classes therefore we would want to come up with a different error metric or a different evaluation metric." "One such evaluation metric are what's called precision recall." "Let me explain what that is." "Let's say we are evaluating a classifier on the test set." "For the examples in the test set the actual" "class of that example in the test set is going to be either one or zero, right, if there is a binary classification problem." "And what our learning algorithm will do is it will, you know, predict some value for the class and our learning algorithm will predict the value for each example in my test set and the predicted value will also be either one or zero." "So let me draw a two by two table as follows, depending on a full of these entries depending on what was the actual class and what was the predicted class." "If we have an example where the actual class is one and the predicted class is one then that's called" "an example that's a true positive, meaning our algorithm predicted that it's positive and in reality the example is positive." "If our learning algorithm predicted that something is negative, class zero, and the actual class is also class zero then that's what's called a true negative." "We predicted zero and it actually is zero." "To find the other two boxes, if our learning algorithm predicts that the class is one but the" "actual class is zero, then that's called a false positive." "So that means our algorithm for the patient is cancelled out in reality if the patient does not." "And finally, the last box is a zero, one." "That's called a false negative because our algorithm predicted zero, but the actual class was one." "And so, we have this little sort of two by two table based on what was the actual class and what was the predicted class." "So here's a different way of evaluating the performance of our algorithm." "We're going to compute two numbers." "The first is called precision - and what that says is," "of all the patients where we've predicted that they have cancer," "what fraction of them actually have cancer?" "So let me write this down, the precision of a classifier is the number of true positives divided by" "the number that we predicted as positive, right?" "So of all the patients that we went to those patients and we told them, "We think you have cancer."" "Of all those patients, what fraction of them actually have cancer?" "So that's called precision." "And another way to write this would be true positives and then in the denominator is the number of predicted positives, and so that would be the sum of the, you know, entries in this first row of the table." "So it would be true positives divided by true positives." "I'm going to abbreviate positive as POS and then plus false positives, again abbreviating positive using POS." "So that's called precision, and as you can tell high precision would be good." "That means that all the patients that we went to and we said, "You know, we're very sorry." "We think you have cancer," high precision means that of that group of patients most of them we had actually made accurate predictions on them and they do have cancer." "The second number we're going to compute is called recall, and what recall say is, if all the patients in, let's say, in the test set or the cross-validation set, but if all the patients in the data" "set that actually have cancer, what fraction of them that we correctly detect as having cancer." "So if all the patients have cancer, how many of them did we actually go to them and you know, correctly told them that we think they need treatment." "So, writing this down, recall is defined as the number of positives, the number of true positives, meaning the number of people that have cancer and that we correctly predicted have cancer" "and we take that and divide that by, divide that by the number of actual positives," "so this is the right number of actual positives of all the people that do have cancer." "What fraction do we directly flag and you know, send the treatment." "So, to rewrite this in a different form, the denominator would be the number of actual positives as you know, is the sum of the entries in this first column over here." "And so writing things out differently, this is therefore, the number of true positives, divided by" "the number of true positives plus the number of" "false negatives." "And so once again, having a high recall would be a good thing." "So by computing precision and recall this will usually give us a better sense of how well our classifier is doing." "And in particular if we have a learning algorithm that predicts y equals zero all the time, if it predicts no one has cancer, then this classifier will have a recall equal to zero, because there won't be any" "true positives and so that's a quick way for us to recognize that, you know, a classifier that predicts y equals 0 all the time, just isn't a very good classifier." "And more generally, even for settings where we have very skewed classes, it's not possible for an algorithm to sort of "cheat"" "and somehow get a very high precision and a very high recall by doing some simple thing like predicting y equals 0 all the time or predicting y equals 1 all the time." "And so we're much more sure that a classifier of a high precision or high recall actually is a good classifier, and this gives us a more useful evaluation metric that is a more direct way to actually understand whether, you know, our algorithm may be doing well." "So one final note in the definition of precision and recall, that we would define precision and recall, usually we use the convention that y is equal to 1, in the presence of the more rare class." "So if we are trying to detect." "rare conditions such as cancer, hopefully that's a rare condition, precision and recall are defined setting y equals" "1, rather than y equals 0, to be sort of that the presence of that rare class that we're trying to detect." "And by using precision and recall, we find, what happens is that even if we have very skewed classes, it's not possible for an algorithm to you know, "cheat" and predict y equals 1 all the time," "or predict y equals 0 all the time, and get high precision and recall." "And in particular, if a classifier is getting high precision and high recall, then we are actually confident that the algorithm has to be doing well, even if we have very skewed classes." "So for the problem of skewed classes precision recall gives us more direct insight into how the learning algorithm is doing and this is often a much better way to evaluate our learning algorithms, than looking at classification error or classification accuracy, when the classes are very skewed." "In the last video, we talked about precision and recall as an evaluation metric for classification problems with skew classes." "For many applications, we'll want to somehow control the trade off between position and recall." "Let me tell you how to do that and also show you some, even more effective ways to use precision and recall as an evaluation metric for learning algorithms." "As a reminder, here are the definitions of precision and recall from the previous video." "Let's continue our cancer classification example, where y equals one if the patient has cancer and y equals zero otherwise." "And let's say we've trained in logistic regression classifier, which outputs probabilities between zero and one." "So, as usual, we're going to predict one, y equals one if h of x is greater than or equal to" "0.5 and predict zero if the hypothesis outputs a value less than 0.5 and this classifier may give us some value for precision and some value for recall." "But now, suppose we want to predict that a patient has cancer only if we're very confident that they really do." "Because you know if you go to a patient and you tell them that they have cancer, it's going to give them a huge shock because this is seriously bad news and they may end up going through a pretty" "painful treatment process and so on." "And so maybe we want to tell someone that we think they have cancer only if they're very confident." "One way to do this would be to modify the algorithm, so that instead of setting the threshold at 0.5, we might instead say that we'll predict that y is equal to 1, only if H of" "x is greater than or equal to 0.7." "So this, I think will tell someone if they have cancer only if we think there's a greater than, greater than or equal to 70% that they have cancer." "And if you do this then you're predicting some of this cancer only when you're more confident, and so you end up with a classifier" "that has higher precision, because all the patients that you're going to and say, you know, we think you have cancer, all of those patients are now pretty, once they hear, pretty confident they actually have cancer." "And so, a higher fraction of the patients that you predict to have cancer, will actually turn out to have cancer, because in making those predictions we are pretty confident." "But in contrast, this classifier will have lower recall, because now we are going to make predictions, we are going to predict y equals one, on a smaller number of patients." "Now we could even take this further." "Instead of setting the threshold at 0.7, we can set this at 0.9 and we'll predict y1 only if we are more than 90% certain that the patient has cancer, and so, you know, a large fraction that" "those patients will turn out to have cancer and so, this is the high precision classifier" "will have lower recall because we want to correctly detect that those patients have cancer." "Now consider a different example." "Suppose we want to avoid missing too many actual cases of cancer." "So we want to avoid the false negatives." "In particular, if a patient actually has cancer, but we fail to tell them that they have cancer, then that can be really bad." "Because if we tell a patient that they don't have cancer then they are not going to go for treatment and if it turns out that they have cancer or we fail to tell them they have cancer, well they may not get treated at all." "And so that would be a really bad outcome because he died because we told them they don't have cancer they failed to get treated, but it turns" "out that they actually have cancer." "When in doubt, we want to predict that y equals one." "So when in doubt, we want to predict that they have cancer so that at least they look further into it" "and this can get treated, in case they do turn out to have cancer." "In this case, rather than setting higher probability threshold, we might instead take this value and this then sets it to a lower value, so maybe" "0.3 like so." "By doing so, we're saying that, you know what, if we think there's more than a 30% chance that they have caner, we better be more conservative and tell them that they may have cancer," "so they can seek treatment if necessary." "And in this case, what we would have is going to be a higher recall classifier," "because we're going to be correctly flagging a higher fraction of all of the patients that actually do have cancer, but we're going to end up with lower precision, because the higher fraction of the patients that we said have" "cancer, the higher fraction of them will turn out not to have cancer after all." "And by the way, just as an aside, when I talk about this to other students up until before, it's pretty amazing." "Some of my say is how I can tell the story both ways." "Why we might want to have higher precision or higher recall and the story actually seems to work both ways." "But I hope the details of the algorithm is true and the more general principle is, depending on where you want, whether you want high precision, lower recall or higher recall, lower precision, you can end up predicting y equals" "one when each of x is greater than some threshold." "And so, in general, for most crossfires, there is going to be a trade off between precision and recall." "And as you vary the value of this threshold, this value, this special that I have joined here, you can actually plot us some curve that trades off precision and recall, where a value up here, this would correspond" "to a very high value of the threshold, maybe threshold equals over 0.99, so that say, predict y equals 1 only where no more than 99 percent confident, at least 99 percent probability this once, so" "that will be a precision relatively low recall, whereas the point down here will correspond to a value of the threshold that's much lower, maybe 0.01." "When in doubt at all, put down y1." "And if you do that, you end up with a much lower precision higher recall crossfire." "And as you vary the threshold, if you want, you can actually trace all the curve from your crossfire to see the range of different values you can get for precision recall." "And by the way, the position recall curve can look like many different shapes." "Sometimes it'll look this, sometimes it'll look like that." "Now, there are many different possible shapes in the position of recall curve, depending on the details of the crossfire." "So this raises another interesting question which is, is there a way to choose this threshold automatically?" "Or, more generally, if we have a few different algorithms or a few different ideas for algorithms, how do we compare different precision recall numbers?" "completely." "Suppose we have three different learning algorithms, or actually maybe these are three different learning algorithms, may be these are the same algorithm, but just with different values for the threshold." "How do we decide which of these algorithms is best?" "One of the things we talked about earlier is the importance of a single real number evaluation metric." "And that is the idea of having a number that just tells you how well is your Crossfire doing." "But by switching to the precision recall metric, we've actually lost that." "We now have two real numbers." "And so we often end up facing situations, like if we are trying to compare algorithm 1 to algorithm 2, we end up asking ourselves, Is a position of point five and a recall of point four, well" "is that better or worse than a position of point seven or a recall point one?" "If every time you try on a new algorithm you end up having to sit around and think well, maybe point five point four, is better than point seven point one, maybe not, I do not know." "If you end up having to sit around and think and make these decisions." "that really slows down your decision making process, for what changes are useful to incorporate into your algorithm." "Where as in contrast, if we had a single real number evaluation metric, like a number that just tells us is either algorithm 1 or is algorithm 2 better." "That helps us much more quickly decide which algorithm to go with and helps us as well to much more quickly evaluate different changes that we may be contemplating for an algorithm." "So, how can we get a single real number evaluation metric." "One natural thing that you might try is to look at the average between precision and recall." "So using p and r to denote position and recall, what you could do is just compute the average and look at what crossfire has the highest average value." "But this turns out not to be such a good solution because, similar to the example we had earlier," "it turns out that if we have a crossfire that predicts" "y1 all the time, then if you do that, you can get a very high recall." "That's you end up with a very low value of Vision." "Conversely,if you have a crossfire that predicts y0 almost all" "the time, that is, if it predicts y one very sparingly." "This corresponds to setting a very high threshold using the notation of previous line." "And then you can actually end up with a very high precision with a very low recall." "So the two extremes of either are a very high threshold or a very low threshold, neither of them would give it paticularary good crossfire." "And we recognize that is by seeing if we end up with a very low precision or a very low recall." "And if you just take the average of people's ro2." "One does the example the average is actually highest for algorithm 3." "Even though you can get that sort of performance by predicting y1 all the time." "And that is just not a very good classifier, right?" "You predict y equals one all the time is just not a useful classifier if all it does is prints out y equals one." "And so algorithm one or algorithm two would be more useful than algorithm three, but in this example algorithm three has a higher average value of precision recall than algorithm one and two." "So we usually think of this average of precision recall as not a particularly good way to evaluate our learning algorithm." "In contrast, there is a different way of combining precision recall." "It is called the f score and it uses that formula." "So, in this example, here are the f scores." "And so we would tell from these f scores and we'll say algorithm 1 has the highest f score." "Algorithm 2 has the second highest and algorithm 3 has the lowest and so you know, if we go by the f score, we would pick probably algorithm of 1 over the others." "The f score, which is also called the f1 score, is usually written f1 score that I have here, but often people will just say f score." "It determines use is a little bit like taking the average of precision of recall, but it gives the lower value of precision and recall" " whichever it is - it gives it a higher weight." "And so, you see in the numerator here that the f score takes a product or position of equal." "And so, if either position is 0 or recall is equal to" "0, the f score will be equal to o." "So in that sense, it kind of combines position and recall." "but for the f score to be large, both position and recall have to be pretty large." "I should say that there are many different possible formulas for combining position and recall." "This f score formula is really, maybe just one out of a much larger number of possibilities, but historically or traditionally this is what people in machine learning use." "And the term f score, you know, it doesn't really mean anything, so don't worry about why it's called f score or f1 score." "But this usually gives you the effect that you want because if either position is" "0 or recall is 0, this gives you a very low f score." "And so, to have a high f score you can't need a preserve quality 1 and completely if p equals zero or i equals zero then this gives you the f score equals zero." "Where as a perfect f score, so if position equals one and [xx] equals one that would give you an f score that's equal to one times one over two times two." "So the f score will be equal to 1 if you have perfect precision and perfect recall." "And intermediate values between 0 and 1, this usually gives a reasonable rank ordering of different classifiers." "So this video we talked about the notion of trading off between position and recall and how we can vary the threshold that we use to decide whether to predict y equals one or y equals zero." "This threshold that says do we need to be at least seventy percent confident or ninety percent confident or whatever before we predict y equals one and by varying the threshold you can control a trade off between precision and recall." "Ross talked about the f score which takes precision and recall and gives you a single real number evaluation metric." "And of course, if your goal is to automatically set that threshold, to decide which one of y equals 1 or y equals 0, one pretty reasonable way to do that would also be to try a range of different values of thresholds." "So, try a range of values of thresholds and evaluate these different threshold on say your cross validation set, and then to pick whatever value of threshold gives you the highest f score on your cross validation setting." "That would be a pretty reasonable way to automatically chose the threshold for your crossfire as well." "In the last few videos, we talked about a collaborative filtering algorithm." "In this video I'm going to say a little bit about the vectorization implementation of this algorithm." "And also talk a little bit about other things you can do with this algorithm." "For example, one of the things you can do is, given one product can you find other products that are related to this so that for example, a user has recently been looking at one product." "Are there other related products that you could recommend to this user?" "So let's see what we could do about that." "What I'd like to do is work out an alternative way of writing out the predictions of the collaborative filtering algorithm." "To start, here is our data set with our five movies and what I'm going to do is take all the ratings by all the users and group them into a matrix." "So, here we have five movies and four users, and so this matrix y is going to be a 5 by 4 matrix." "It's just you know, taking all of the elements, all of this data." "Including question marks, and grouping them into this matrix." "And of course the elements of this matrix of the (i, j) element of this matrix is really what we were previously writing as y superscript i, j." "It's the rating given to movie i by user j." "Given this matrix y of all the ratings that we have, there's an alternative way of writing out all the predictive ratings of the algorithm." "And, in particular if you look at what a certain user predicts on a certain movie, what user j predicts on movie i is given by this formula." "And so, if you have a matrix of the predicted ratings, what you would have is the following" "matrix where the i, j entry." "So this corresponds to the rating that we predict using j will give to movie i" "is exactly equal to that theta j transpose" "XI, and so, you know, this is a matrix where this first element the one-one element is a predictive rating of user one or movie one and this element, this is the one-two element is the predicted rating" "of user two on movie one, and so on, and this is the predicted rating of user one on the last movie and if you want, you know, this rating is what we would have predicted for this value" "and this rating is what we would have predicted for that value, and so on." "Now, given this matrix of predictive ratings there is then a simpler or vectorized way of writing these out." "In particular if I define the matrix x, and this is going to be just like the matrix we had earlier for linear regression to be" "sort of x1 transpose x2 transpose down to" "x of nm transpose." "So I'm take all the features for my movies and stack them in rows." "So if you think of each movie as one example and stack all of the features of the different movies and rows." "And if we also to find a matrix capital theta," "and what I'm going to do is take each of the per user parameter vectors, and stack them in rows, like so." "So that's theta 1, which is the parameter vector for the first user." "And, you know, theta 2, and so, you must stack them in rows like this to define a matrix capital theta and so I have" "nu parameter vectors all stacked in rows like this." "Now given this definition for the matrix x and this definition for the matrix theta in order to have a vectorized way of computing the matrix of all the predictions you can just compute x times" "the matrix theta transpose, and that gives you a vectorized way of computing this matrix over here." "To give the collaborative filtering algorithm that you've been using another name." "The algorithm that we're using is also called low rank" "matrix factorization." "And so if you hear people talk about low rank matrix factorization that's essentially exactly the algorithm that we have been talking about." "And this term comes from the property that this matrix x times theta transpose has a mathematical property in linear algebra called that this is a low rank matrix and so that's what gives rise to this name low rank matrix factorization for these" "algorithms, because of this low rank property of this matrix x theta transpose." "In case you don't know what low rank means or in case you don't know what a low rank matrix is, don't worry about it." "You really don't need to know that in order to use this algorithm." "But if you're an expert in linear algebra, that's what gives this algorithm, this other name of low rank matrix factorization." "Finally, having run the collaborative filtering algorithm here's something else that you can do which is use the learned features in order to find related movies." "Specifically for each product i really for each movie i, we've learned a feature vector xi." "So, you know, when you learn a certain features without really know that can the advance what the different features are going to be, but if you run the algorithm and perfectly the features will tend to capture what are" "the important aspects of these different movies or different products or what have you." "What are the important aspects that cause some users to like certain movies and cause some users to like different sets of movies." "So maybe you end up learning a feature, you know, where x1 equals romance, x2 equals action similar to an earlier video and maybe you learned a different feature x3 which is a degree to which this is a comedy." "Then some feature x4 which is, you know, some other thing." "And you have N features all together and after" "you have learned features it's actually often pretty difficult to go in to the learned features and come up with a human understandable interpretation of what these features really are." "But in practice, you know, the features even though these features can be hard to visualize." "It can be hard to figure out just what these features are." "Usually, it will learn features that are very meaningful for capturing whatever are the most important or the most salient properties of a movie that causes you to like or dislike it." "And so now let's say we want to address the following problem." "Say you have some specific movie i and you want to find other movies j that are related to that movie." "And so well, why would you want to do this?" "Right, maybe you have a user that's browsing movies, and they're currently watching movie j, than what's a reasonable movie to recommend to them to watch after they're done with movie j?" "Or if someone's recently purchased movie j, well, what's a different movie that would be reasonable to recommend to them for them to consider purchasing." "So, now that you have learned these feature vectors, this gives us a very convenient way to measure how similar two movies are." "In particular, movie i has a feature vector xi." "and so if you can find a different movie, j, so that the distance between xi and xj is small," "then this is a pretty strong indication that, you know, movies j and i are somehow similar." "At least in the sense that some of them likes movie i, maybe more likely to like movie j as well." "So, just to recap, if your user is looking at some movie i and if you want to find the 5 most similar movies to that movie in order to recommend" "5 new movies to them, what you do is find the five movies j, with the smallest distance between the features between these different movies." "And this could give you a few different movies to recommend to your user." "So with that, hopefully, you now know how to use a vectorized implementation to compute all the predicted ratings of all the users and all the movies, and also how to do things like use learned features to find what might be movies" "and what might be products that aren't related to each other." "In the previous video, we talked about evaluation metrics." "In this video, I'd like to switch tracks a bit and touch on another important aspect of machine learning system design, which will often come up, which is the issue of how much data to train on." "Now, in some earlier videos, I had cautioned against blindly going out and just spending lots of time collecting lots of data, because it's only sometimes that that would actually help." "But it turns out that under certain conditions, and I will say in this video what those conditions are, getting a lot of data and training on a certain type of learning algorithm, can be a very effective way to get" "a learning algorithm to do very good performance." "And this arises often enough that if those conditions hold true for your problem and if you're able to get a lot of data, this could be a very good way to get a very high performance learning algorithm." "So in this video, let's talk more about that." "Let me start with a story." "Many, many years ago, two researchers that I know, Michelle Banko and" "Eric Broule ran the following fascinating study." "They were interested in studying the effect of using different learning algorithms versus trying them out on different training set sciences, they were considering the problem of classifying between confusable words, so for example, in the sentence:" "for breakfast I ate, should it be to, two or too?" "Well, for this example, for breakfast I ate two, 2 eggs." "So, this is one example of a set of confusable words and that's a different set." "So they took machine learning problems like these, sort of supervised learning problems to try to categorize what is the appropriate word to go into a certain position in an English sentence." "They took a few different learning algorithms which were, you know, sort of considered state of the art back in the day, when they ran the study in" "2001, so they took a variance, roughly a variance on logistic regression called the Perceptron." "They also took some of their algorithms that were fairly out back then but somewhat less used now so when the algorithm also very similar to which is a regression but different in some ways, much used somewhat less, used" "not too much right now took what's called a memory based learning algorithm again used somewhat less now." "But I'll talk a little bit about that later." "And they used a naive based algorithm, which is something they'll actually talk about in this course." "The exact algorithms of these details aren't important." "Think of this as, you know, just picking four different classification algorithms and really the exact algorithms aren't important." "But what they did was they varied the training set size and tried out these learning algorithms on the range of training set sizes and that's the result they got." "And the trends are very clear right first most of these outer rooms give remarkably similar performance." "And second, as the training set size increases, on the horizontal axis is the training set size in millions" "go from you know a hundred thousand up to a thousand million that is a billion training examples." "The performance of the algorithms all pretty much monotonically increase and the fact that if you pick any algorithm may be pick a "inferior algorithm" but if you give that "inferior algorithm" more data, then from" "these examples, it looks like it will most likely beat even a "superior algorithm"." "So since this original study which is very influential, there's been a range of many different studies showing similar results that show that many different learning algorithms you know tend to, can sometimes, depending on details, can give pretty similar ranges" "of performance, but what can really drive performance is you can give the algorithm a ton of training data." "And this is, results like these has led to a saying in machine learning that often in machine learning it's not who has the best algorithm that wins, it's who has the" "most data So when is this true and when is this not true?" "Because we have a learning algorithm for which this is true then getting a lot of data is often maybe the best way to ensure that we have an algorithm with very high performance rather than you know, debating worrying about exactly which of these items to use." "Let's try to lay out a set of assumptions under which having a massive training set we think will be able to help." "Let's assume that in our machine learning problem, the features x have sufficient information with which we can use to predict y accurately." "For example, if we take the confusable words all of them that we had on the previous slide." "Let's say that it features x capture what are the surrounding words around the blank that we're trying to fill in." "So the features capture then we want to have, sometimes for breakfast I have black eggs." "Then yeah that is pretty much information to tell me that the word I want in the middle is TWO and that is not word TO and its not the word TOO." "So" "the features capture, you know, one of these surrounding words then that gives me enough information to pretty unambiguously decide what is the label y or in other words what is the word that I should be using to fill" "in that blank out of this set of three confusable words." "So that's an example what the future ex has sufficient information for specific y." "For a counter example." "Consider a problem of predicting the price of a house from only the size of the house and from no other" "features." "So if you imagine I tell you that a house is, you know, 500 square feet but I don't give you any other features." "I don't tell you that the house is in an expensive part of the city." "Or if I don't tell you that the house, the number of rooms in the house, or how nicely furnished the house is, or whether the house is new or old." "If I don't tell you anything other than that this is a" "500 square foot house, well there's so many other factors that would affect the price of a house other than just the size of a house that if all you know is the size, it's actually very difficult to predict the price accurately." "So that would be a counter example to this assumption that the features have sufficient information to predict the price to the desired level of accuracy." "The way I think about testing this assumption, one way I often think about it is, how often I ask myself." "Given the input features x, given the features, given the same information available as well as learning algorithm." "If we were to go to human expert in this domain." "Can a human experts actually or can human expert confidently predict the value of y." "For this first example if we go to, you know an expert human English speaker." "You go to someone that speaks English well, right, then a human expert in English just read most people like you and me will probably we would probably be able to predict what word should go in here, to a good English" "speaker can predict this well, and so this gives me confidence that x allows us to predict y accurately, but in contrast if we go to an expert in human prices." "Like maybe an expert realtor, right, someone who sells houses for a living." "If I just tell them the size of a house and I tell them what the price is well even an expert in pricing or selling houses wouldn't be able to tell me and so this is fine that for the housing price example knowing" "only the size doesn't give me enough information to predict the price of the house." "So, let's say, this assumption holds." "Let's see then, when having a lot of data could help." "Suppose the features have enough information to predict the value of y." "And let's suppose we use a learning algorithm with a large number of parameters so maybe logistic regression or linear regression with a large number of features." "Or one thing that I sometimes do, one thing that I often do actually is using neural network with many hidden units." "That would be another learning algorithm with a lot of parameters." "So these are all powerful learning algorithms with a lot of parameters that can fit very complex functions." "So, I'm going to call these, I'm going to think of these as low-bias algorithms because you know we can fit very complex functions" "and because we have a very powerful learning algorithm, they can fit very complex functions." "Chances are, if we run these algorithms on the data sets, it will be able to fit the training set well, and so hopefully the training error will be slow." "Now let's say, we use a massive, massive training set, in that case, if we have a huge training set, then hopefully even though we have a lot of parameters but if the training set is sort of even much" "larger than the number of parameters then hopefully these albums will be unlikely to overfit." "Right because we have such a massive training set and by unlikely to overfit what that means is that the training error will hopefully be close to the test error." "Finally putting these two together that the train set error is small and the test set error is close to the training error what this two together imply is that hopefully the test set error" "will also be small." "Another way to think about this is that in order to have a high performance learning algorithm we want it not to have high bias and not to have high variance." "So the bias problem we're going to address by making sure we have a learning algorithm with many parameters and so that gives us a low bias alorithm" "and by using a very large training set, this ensures that we don't have a variance problem here." "So hopefully our algorithm will have no variance and so is by pulling these two together, that we end up with a low bias and a low variance learning algorithm and this allows us to do well on the test set." "And fundamentally it's a key ingredients of assuming that the features have enough information and we have a rich class of functions that's why it guarantees low bias," "and then it having a massive training set that that's what guarantees more variance." "So this gives us a set of conditions rather hopefully some understanding of what's the sort of problem where if you have a lot of data and you train a learning algorithm with lot of parameters, that might be a good way to give" "a high performance learning algorithm and really, I think the key test that" "I often ask myself are first, can a human experts look at the features x and confidently predict the value of y." "Because that's sort of a certification that y can be predicted accurately from the features x and second, can we actually get a large training set, and train the learning algorithm with a lot of parameters in the training" "set and if you can't do both then that's more often give you a very kind performance learning algorithm." "By now, you see the range of different learning algorithms." "Within supervised learning, the performance of many supervised learning algorithms will be pretty similar and when that is less more often be whether you use learning algorithm A or learning algorithm" "B but when that is small there will often be things like the amount of data you are creating these algorithms on." "That's always your skill in applying this algorithms." "Seems like your choice of the features that you designed to give the learning algorithms and how you choose the regularization parameter and things like that." "But there's one more algorithm that is very powerful and its very widely used both within industry and in Academia." "And that's called the support vector machine, and compared to both the logistic regression and neural networks, the" "support vector machine or the SVM sometimes gives a cleaner and sometimes more powerful way of learning complex nonlinear functions." "And so I'd like to take the next videos to talk about that." "Later in this course, I will do a quick survey of the range of different supervised learning algorithms just to very briefly describe them but the support vector machine, given its popularity and how popular it is, this will be" "the last of the supervised learning algorithms that I'll spend a significant amount of time on in this course." "As with our development of ever learning algorithms, we are going to start by talking about the optimization objective, so let's get started on this algorithm." "In order to describe the support vector machine, I'm actually going to start with logistic regression and show how we can modify it a bit and get what is essentially the support vector machine." "So, in logistic regression we have our familiar form of the hypotheses there and the sigmoid activation function shown on the right." "And in order to explain some of the math, I'm going to use z to denote failure of transpose x here." "Now let's think about what we will like the logistic regression to do." "If we have an example with y equals 1, and by this I mean an example in either a training set or the test set, you know, order cross valuation set where y is equal to 1 then" "we are sort of hoping that h of x will be close to 1." "So, right, we are hoping to correctly classify that example" "and what, having h of x close to 1, what that means is that theta transpose x must be much larger than 0, so there's greater than, greater than sign, that means much, much greater" "than 0 and that's because it is z, that is theta transpose x is when z is much bigger than" "0, is far to the right of this figure that, you know, the output of logistic regression becomes close to 1." "Conversely, if we have an example where y is equal to 0 then what were hoping for is that the hypothesis will output the value to close to 0 and that corresponds to theta transpose x or z pretty much" "less than 0 because that corresponds to hypothesis of outputting a value close to 0." "If you look at the cost function of logistic regression, what you find is that each example x, y, contributes a term like this to the overall cost function." "All right." "So, for the overall cost function, we usually, we will also have a sum over all the training examples and 1 over m term." "But this expression here." "That's the term that a single training example contributes to the overall objective function for logistic regression." "Now, if I take the definition for the full of my hypothesis and plug it in, over here," "the one I get is that each training example contributes this term, right?" "Ignoring the 1 over m but it contributes that term to be my overall cost function for logistic regression." "Now let's consider the 2 cases of when y is equal to 1 and when y is equal to 0." "In the first case, let's suppose that y is equal to 1." "In that case, only this first row in the objective matters because this" "1 minus y term will be equal to 0 if y is equal to 1." "So, when y is equal to 1 when in an example, x, y when y is equal to" "1, what we get is this term minus log 1 over 1 plus e to the negative z." "Where, similar to the last slide," "I'm using z to denote data transpose x." "And of course, in the cost we actually had this minus y but we just said that you know, if y is equal to 1." "So that's equal to 1." "I just simplified it a way in the expression that" "I have written down here." "And if we plot this function, as a function of z, what you find is that you get this curve shown on the lower left of this line and thus we also see that when z is equal to large that is to when" "theta transpose x is large that corresponds to a value of z that gives us a very small value, a very small contribution to the cost function and this kind of explains why when logistic regression sees a positive example" "with y equals 1 it tries to set theta transpose x to be very large because that corresponds to this term in a cost function being small." "Now, to build the Support Vector Machine, here is what we are going to do." "We are going to take this cost function, this minus log 1 over 1 plus e to the negative z and modify it a little bit." "Let me take this point" "1 over here and let me draw the course function that I'm going to use, the new cost function is gonna be flat from here on out" "and then I'm going to draw something that grows as a straight line similar to logistic regression but this is going to be the straight line in this posh inning." "So the curve that" "I just drew in magenta." "The curve that I just drew purple and magenta." "So, it's a pretty close approximation to the cost function used by logistic regression except that it is now made out of two line segments." "This is flat potion on the right and then this is a straight line portion on the left." "And don't worry too much about the slope of the straight line portion." "It doesn't matter that much but that's the new cost function we're going to use where y is equal to 1 and you can imagine you should do something pretty similar to logistic regression but it turns out that this will give the" "support vector machine computational advantage that will give us later on an easier optimization problem, that will be easier for stock trades and so on." "We just talked about the case of y equals to 1." "The other case is if y is equal to 0." "In that case, if you look at the cost then only this second term will apply because the first term goes a way where if y is equal to 0 then nearly it was 0 here." "So your left only with the second term of the expression above" "and so the cost of an example or the contribution of the cost function is going to be given by this term over here and if you plug that as a function z." "So, I have here z on the horizontal axis, you end up with this group and for the support vector machine, once again we're going to replace this blue line with something similar and see if we can" "replace it with a new cost, there is flat out here." "There's 0 out here and then it grows as a straight line like so." "So, let me give these two functions names." "This function on the left, I'm going to call" "cost subscript 1 of z." "And this function on the right, I'm going to call cost subscript 0" "of z." "And the subscript just refers to the cost corresponding to y is equal to 1 versus y is equal to 0." "Armed with these definitions, we are now ready to build the support vector machine." "Here is the cost function j of theta that we have for logistic regression." "In case this the equation looks a bit unfamiliar is because previously we had a minor sign outside, but here what I did was I instead moved the minor signs inside this expression." "So it just, you know, makes it look a little more different." "For the support vector machine what we are going to do is essentially take this, and replace this with cost 1 of z, that is cost 1 of theta transpose x." "I'm going to take this and replace it with cost" "0 of z." "This is cost 0 of theta transpose x where the cost 1 function is what we had on the previous line that looks like this and the cost 0 function, again what we have on the previous line that looks like this." "So, what we have for the support vector machine is an minimizationminimalization problem of one of" "1 over m, sum over my training examples of y(i) times cost" "1 of theta transpose x(i) plus 1 minus y(i) times cost zero of theta transpose x(i)." "And then plus my usual regularization" "parameter like so." "Now by convention for the Support" "Vector Machine, we actually write things slightly differently." "We parametrize this just very slightly differently." "First, we're going to get rid of the 1 over m terms and this just, this just happens to be a slightly different convention that people use for support vector machines compared to for logistic regression." "But here's what" "I mean, you know, what I'm going to do is I am just gonna get rid of this 1 over m terms and this should give me the same optimal value for theta, right." "Because 1 over m is just a constant." "So, you know, whether I solve this minimization problem with 1 over m in front or not," "I should end up with the same optimal value of theta." "Here is what I mean, to give you a concrete example, suppose I had a minimization problem that you know minimize over a real number u of u minus 5 squared," "plus 1, right." "Well, the minimum of this happens, happens to know the minimum of this is u equals 5." "Now if I want to take this objective function and multiply it by 10, so here my minimization problem is minimum of u of 10, u minus" "5 squared plus 10." "Well the value of u that minimizes this is still u equals 5, right." "So, multiplying something that you are minimizing over by some constant, 10 in this case, it does not change the value of u that gives us, that minimizes this function." "So the same way what I've done by crossing out this m is, all I am doing is multiplying my objective function by some constant m and it doesn't change the value of theta that achieves the minimum." "The second bit of notational change, we're just designating the most standard convention, when using as the SVM, instead of logistic regression as a following." "So, for logistic regression, we had two terms to our objective function." "The first is this term which is the cost that comes from the training set and the second is this term, which is the regularization term and what we had, we had to control the trade off between these by saying, you know, that we" "wanted to minimize A plus and then my regularization parameter lambda," "and then times some other term B, right?" "Where as I am using A to denote this first term, and I am using B to denote that second term, may be without the lambda, and instead of prioritizing this as A plus lambda B, we could, and so what we" "did was by setting different values for this regularization parameter lambda." "We could trade off the relative way between how much we want to fit the training set well, as minimizing A, versus how much we care about keeping the values of the parameters small." "So that would be for the parameters B. For the Support" "Vector Machine, just by convention we're going to use a different parameter." "So instead of using lambda here to control the relative waiting between you know, the first and second terms, we are still going to use a different parameter which by convention is called C and we instead are going to minimize C times" "A plus B. So for logistic regression if we send a very large value of lambda, that means to give" "B a very high weight." "Here is that if we set C to be a very small value." "That corresponds to giving B much larger weight than C than A." "So this is just a different way of controlling the trade off or just a different way of parametrizing how much we care about optimizing the first term versus how much we care about optimizing the second term." "And if you want you can think of this as the parameter" "C playing a role similar to 1 over lambda and it's not that it's two equations or these two expressions will be equal, it's equals 1 over lambda and it's not that these two equations or these two expressions will be equal." "It's equals t 1 over lambda." "That's not the case where it bothers that if C is equal to 1 over lambda then these two optimization objectives should give you the same value, same optimal value of theta." "So just filling that in." "I'm gonna cross out lambda here and write in the constant C there." "So,that's gives us our overall optimization objective function for the Support Vector" "Machine and where you minimize that function then what you have is the parameters learned by SVM." "Finally on light of logistic regression, the Support" "Vector Machine doesn't output the probability." "Instead what we have is, we have this cost function which we minimize to get the parameters theta and what the Support Vector Machine does," "is it just makes the prediction of y being equal 1 or 0 directly." "So the hypothesis, where I predict, 1, if theta transpose x is greater than or equal to" "0 and I'll predict 0 otherwise." "And so, having learned the parameters theta, this is the form of the hypothesis for the support vector machine." "So, that was a mathematical definition of what a support vector machine does." "In the next few videos, let's try to get back to intuition about what this optimization objective leads to and whether the source of the hypothesis a SVM will learn and also talk about how to modify this just a little bit to" "learn complex, nonlinear functions." "Sometime people talk about support vector machines,as large margin classifiers,in this video I'd like to tell you what that means, and this will also give us a useful picture of what an hypothesis may look like." "Here's my cost function for the support vector machine where here on the left" "I've plotted my cost 1 of z function that I used for positive examples and on the right I've plotted my" "zero of 'Z' function graph here, with 'Z' on the horizontal axis." "Now, let's think about what it takes to these cos functions small." "If you have a positive example, so if y is equal to" "1, then cos 1 of" "Z is zero only when" "Z is greater than or equal to 1." "So in other words, if you have a positive example, we really want theta [xx] to be greater than or equal to 1 and conversely if y is equal to zero, of this cos zero of z function," "then it's only in this region that z is less than equal to 1 we have the course is zero as z is equals to zero, and this is an interesting property of the support vector machine right, which is" "that, if you have positive example so if y is equal to one, then all we really need is that data transport is greater than equal zero." "And that would mean we classify correctly because if theta [xx] x is greater than zero our hypothesis will predict zero." "And similarly, if you have a negative example, they less than zero and that will make sure we get the example right." "But the support machine wants to do more than that." "It says, you don't just barely get the example right." "So then don't just have it just a little bit bigger thans zero." "What i really want to is for this to be quite a lot because its zero, say maybe quick printed and you can and I want this to be much less than zero." "Maybe I want it less than or equal to -1." "And so this builds in an extra safety factor or safety margin factor the support vector machine logstic regressions are something similar too of course but lets see what happens or lets see what comes consequences of this are, in the" "context of the support vector machine." "Concretely, what we'd like to do next is consider a case where we set this constant C to be a very large value, so let's imagine we shall see through a very large value, may be a hundred thousand, some huge number." "Lets see what the support vector machine will do." "If C is very, very large, then when minimizing this optimization objective, we're going to be highly motivated to choose a value, so that this first term is equal to zero." "So let's try to understand the optimization problem in the context of, what would it take to make this first term in the objective equal to zero, because you know, maybe we'll set C to some huge constant, and this" "will hope, this should give us additional intuition about what sort of hypotheses a support vector machine learns." "So we saw already that whenever you have a training example with a label of y1 if you want to make that first term zero, what you need is is to find a value of theta so that theta is greater" "than or equal to 1." "Whenever we have an example, we label zero in order to make sure that the cost cause zero of Z, in order to make sure that the cost cause zero we need that theta transfer was xy is less than or" "equal to, minus one?" "So, if we think of our optimization problem as now, really choosing parameters and showing that this first term is equal to zero, what we're left with is the following optimization problem, we're going to minimize that first" "term zero, so if C times zero, because we're going to choose parameters so that's equal to zero, plus one half and then you know that second term and this first term is 'C' times zero," "so let's just cross that out because I know that's going to be zero." "And this will be subject to the constraint that theta transpose xi is greater than or equal to" "one, if yi" "Is equal to one and the xi is less than or equal to minus one whenever you have a negative example and it turns out that when you solve this authorization problem it minimize this as a function of the parameters date" "you get a very interesting decision boundary." "Concretely if you look at a data set like this with positive and negative examples, this data" "is linearly separable and by that I mean that there exists s straight line, there is as many straight rate on it as they can separate the positive the good example is perfectly." "For example, here is one discussion boundary that separates the positive and negative example, but somehow that doesn't look like a very natural one, right or by drawing even worse you know here's another decision boundary that" "separates the positive and negative examples and I'll give examples we'll just be having but neither of those seem like categorical choices." "The Support Vector Machines will instead choose this decision boundary, whichI'm drawing in black." "And that seems like a much better." "boundary then either of the one's that I drew in magenta or in green." "The black line seems like a more robust separator, it does a better job of separating the positive and negative example and mathematically, what that does is, this boundary has a larger distance." "That distance is called the margin, when I draw up this two extra blue lines, we see that the black decision boundary has some larger minimum distance from any of my." "The good example is we are extra momentary the green lines thick humble thick close to the training example that seems to do a less great job separating the positive and negative classes than my black line." "And so this distance is called the margin of the support fax machine and this gives the SVM a certain robustness, because it tries to separate the data with as a large a margin as possible." "So the support vector machine is sometimes also called a large margin classifier and this is actually a consequence of the optimization problem we wrote down on the previous slide." "I know that you might be wondering how is it that the optimization problem I wrote down in the previous while, how does that lead to this large margin classifier." "I know I haven't explained that yet." "And in the next video" "I'm going to sketch a little bit of the intuition about why that optimization problem gives us this large margin classifier, but this is a useful feature to keep in mind if you are trying to understand what are the" "source of hypothesis that the Nesbian will choose." "That is, trying to separate the positive and negative examples with as big a margin as possible." "Once you say one last thing about large margin classifier in this intuition, so we left out this large margin classification setting in the case of when C organization concepts was very large, I think they say that's a hundred thousand something." "So give a dataset like this, maybe we'll choose that decision boundary that separate the possible examples of large margins." "Now, SVM is actually sliding more sophisticated than this large margin view might suggest and in particular all you're doing is use a large margin classifier then your learning algorithms can be sensitive to out liners, so lets just" "add an extra your positive example like that shown on the screen." "If he had one example then it seems as that the separate data with a large margin," "maybe I'll end up learning the decision boundary like that right by this little gentle line and it's really not clear that based on the single outlier based on a single example and it's really not clear that it's" "actually a good idea to change my decision boundary from the black one over to the magenta one." "So, if C, if the regularization parameter C were very large, then this is actually what SVM will do, it will change the discussion boundary from the black to the magenta one but if c is reasonably small if" "you were to use the C, not too large then you still end up with this black decision boundary you the data were not." "And of course if the data were not linearly separable so if you had some positive examples in here, or if you had some negative examples in here then the SVM will also do the right thing." "And so this picture of a large margin classifier that's really, that's really the picture that is only for the case of when the regulations and C is very large, and just to remind you this corresponds C" "plays a role similar to one over Lambda when Lambda is the regularization parameter we have previously have so it's only that one of the Lambda is very large or if" "Lambda with is very small that you end up with things like this Magenta decision boundary, but" "in practice when the applying support vector machines" ",when C is not very, very large like it can, we want if it can do a better job ignoring the few other lines like here." "And it'll fine and do reasonable things even if your data is not linearly separable." "But when we talk about buyers and theories in the context of support vector machines which will do a little bit later, hopefully all of you sterols involve the regularization perimeter will become clearer at that time." "So I hope that gives some intuition about how this functions as a large margin crossfire that tries to separate the data with a large margin, technically this picture of this view is true only when the parameter C is very large" ", which is a useful way to think about support vector machines." "There was one missing step in this video which is, why is it that optimization problem we wrote down on these lines, how does that actually lead to the large margin classifier, I didn't do that in this video," "in the next video I will sketch a little bit more of the man behind that to explain that separate reasoning of how the optimization problem we wrote out results in a large margin classifier." "In this video, I'd like to tell you a bit about the math behind large margin classification." "This video is optional, so please feel free to skip it." "It may also give you better intuition about how the optimization problem of the support vex machine, how that leads to large margin classifiers." "In order to get started, let me first remind you of a couple of properties of what vector inner products look like." "Let's say I have two vectors" "U and V, that look like this." "So both two dimensional vectors." "Then let's see what U transpose V looks like." "And U transpose V is also called the inner products between the vectors U and V." "Use a two dimensional vector, so" "I can on plot it on this figure." "So let's say that's the vector U. And what I mean by that is if on the horizontal axis that value takes whatever value" "U1 is and on the vertical axis the height of that is whatever U2 is the second component of the vector U. Now, one quantity that will be nice to have is the norm" "of the vector U. So, these are, you know, double bars on the left and right that denotes the norm or length of" "U. So this just means; really the euclidean length of the vector U. And this is Pythagoras theorem is just equal to U1 squared plus U2 squared square root, right?" "And this is the length of the vector U. That's a real number." "Just say you know, what is the length of this, what is the length of this vector down here." "What is the length of this arrow that I just drew, is the normal view?" "Now let's go back and look at the vector V because we want to compute the inner product." "So V will be some other vector with, you know, some value V1, V2." "And so, the vector" "V will look like that, towards V like so." "Now let's go back and look at how to compute the inner product between U and V. Here's how you can do it." "Let me take the vector V and project it down onto the vector U. So I'm going to take a orthogonal projection or a 90 degree projection, and project it down onto U like so." "And what I'm going to do measure length of this red line that I just drew here." "So, I'm going to call the length of that red line P. So, P is the length or is the magnitude of the projection" "of the vector V onto the vector U. Let me just write that down." "So, P is the length of the projection of the vector V onto the vector U. And it is possible to show that unit product U transpose V, that this is going to be equal to P times the" "norm or the length of the vector U. So, this is one way to compute the inner product." "And if you actually do the geometry figure out what" "P is and figure out what the norm of U is." "This should give you the same way, the same answer as the other way of computing unit product." "Right." "Which is if you take U transpose V then U transposes this U1 U2, its a one by two matrix, 1 times V. And so this should actually give you" "U1, V1 plus U2, V2." "And so the theorem of linear algebra that these two formulas give you the same answer." "And by the way, U transpose" "V is also equal to" "V transpose U. So if you were to do the same process in reverse, instead of projecting" "V onto U, you could project" "U onto V. Then, you know, do the same process, but with the rows of U and V reversed." "And you would actually, you should actually get the same number whatever that number is." "And just to clarify what's going on in this equation the norm of U is a real number and P is also a real number." "And so U transpose V is the regular multiplication as two real numbers of the length of P times the normal view." "Just one last detail, which is if you look at the norm of" "P, P is actually signed so to the right." "And it can either be positive or negative." "So let me say what I mean by that, if U is a vector that looks like this and V is a vector that looks like this." "So if the angle between U and V is greater than ninety degrees." "Then if I project V onto" "U, what I get is a projection it looks like this and so that length" "P. And in this case, I will still have that U transpose V is equal to P times the norm of U. Except in this example P will be negative." "So, you know, in inner products if the angle between U and V is less than ninety degrees, then P is the positive length for that red line whereas if the angle of this angle of here is greater" "than 90 degrees then P here will be negative of the length of the super line of that little line segment right over there." "So the inner product between two vectors can also be negative if the angle between them is greater than 90 degrees." "So that's how vector inner products work." "We're going to use these properties of vector inner product to try to understand the support vector machine optimization objective over there." "Here is the optimization objective for the support vector machine that we worked out earlier." "Just for the purpose of this slide I am going to make one simplification or once just to make the objective easy to analyze and what I'm going to do is ignore the indeceptrums." "So, we'll just ignore theta 0 and set that to be equal to 0." "To make things easier to plot, I'm also going to set N the number of features to be equal to 2." "So, we have only 2 features," "X1 and X2." "Now, let's look at the objective function." "The optimization objective of the" "SVM." "What we have only two features." "When N is equal to 2." "This can be written, one half of theta one squared plus theta two squared." "Because we only have two parameters, theta one and thetaa two." "What I'm going to do is rewrite this a bit." "I'm going to write this as one half of theta one squared plus theta two squared and the square root squared." "And the reason I can do that, is because for any number, you know, W, right, the" "square roots of W and then squared, that's just equal to W. So square roots and squared should give you the same thing." "What you may notice is that this term inside is that's equal to the norm" "or the length of the vector theta and what" "I mean by that is that if we write out the vector theta like this, as you know theta one, theta two." "Then this term that I've just underlined in red, that's exactly the length, or the norm, of the vector theta." "We are calling the definition of the norm of the vector that we have on the previous line." "And in fact this is actually equal to the length of the vector theta, whether you write it as theta zero, theta 1, theta 2." "That is, if theta zero is equal to zero, as I assume here." "Or just the length of theta 1, theta 2; but for this line I am going to ignore theta 0." "So let me just, you know, treat theta as this, let me just write theta, the normal theta as this theta 1, theta 2 only, but the math works out either way, whether we include theta zero here or not." "So it's not going to matter for the rest of our derivation." "And so finally this means that my optimization objective is equal to one half of the norm of theta squared." "So all the support vector machine is doing in the optimization objective is it's minimizing the squared norm of the square length of the parameter vector theta." "Now what I'd like to do is look at these terms, theta transpose X and understand better what they're doing." "So given the parameter vector theta and given and example x, what is this is equal to?" "And on the previous slide, we figured out what U transpose" "V looks like, with different vectors U and V. And so we're going to take those definitions, you know, with theta and X(i) playing the roles of U and V." "And let's see what that picture looks like." "So, let's say I plot." "Let's say I look at just a single training example." "Let's say I have a positive example the drawing was across there and let's say that is my example X(i), what that really means is" "plotted on the horizontal axis some value X(i) 1 and on the vertical axis" "X(i) 2." "That's how I plot my training examples." "And although we haven't been really thinking of this as a vector, what this really is, this is a vector from the origin from 0, 0 out to" "the location of this training example." "And now let's say we have a parameter vector and" "I'm going to plot that as vector, as well." "What I mean by that is if I plot theta 1 here and theta 2 there" "so what is the inner product theta transpose X(i)." "While using our earlier method, the way we compute that is we take my example and project it onto my parameter vector theta." "And then I'm going to look at the length of this segment that I'm coloring in, in red." "And I'm going to call that P superscript I to denote that this is a projection of the i-th training example" "onto the parameter vector theta." "And so what we have is that theta transpose X(i) is equal to following what we have on the previous slide, this is going to be equal to" "P times the length of the norm of the vector theta." "And this is of course also equal to theta 1 x1" "plus theta 2 x2." "So each of these is, you know, an equally valid way of computing the inner product between theta and X(i)." "Okay." "So where does this leave us?" "What this means is that, this constrains that theta transpose X(i) be greater than or equal to one or less than minus one." "What this means is that it can replace the use of constraints that P(i) times X be greater than or equal to one." "Because theta transpose X(i) is equal to P(i) times the norm of theta." "So writing that into our optimization objective." "This is what we get where I have, instead of theta transpose X(i), I now have this P(i) times the norm of theta." "And just to remind you we worked out earlier too that this optimization objective can be written as one half times the norm of theta squared." "So, now let's consider the training example that we have at the bottom and for now, continuing to use the simplification that theta 0 is equal to 0." "Let's see what decision boundary the support vector machine will choose." "Here's one option, let's say the support vector machine were to choose this decision boundary." "This is not a very good choice because it has very small margins." "This decision boundary comes very close to the training examples." "Let's see why the support vector machine will not do this." "For this choice of parameters it's possible to show that the parameter vector theta is actually at 90 degrees to the decision boundary." "And so, that green decision boundary corresponds to a parameter vector theta that points in that direction." "And by the way, the simplification that theta 0 equals 0 that just means that the decision boundary must pass through the origin, (0,0) over there." "So now, let's look at what this implies for the optimization objective." "Let's say that this example here." "Let's say that's my first example, you know," "X1." "If we look at the projection of this example onto my parameters theta." "That's the projection." "And so that little red line segment." "That is equal to P1." "And that is going to be pretty small, right." "And similarly, if this example here, if this happens to be X2, that's my second example." "Then, if I look at the projection of this this example onto theta." "You know." "Then, let me draw this one in magenta." "This little magenta line segment, that's going to be P2." "That's the projection of the second example onto my, onto the direction" "of my parameter vector theta which goes like this." "And so, this little projection line segment is getting pretty small." "P2 will actually be a negative number, right so P2 is in the opposite direction." "This vector has greater than 90 degree angle with my parameter vector theta, it's going to be less than 0." "And so what we're finding is that these terms P(i) are going to be pretty small numbers." "So if we look at the optimization objective and see, well, for positive examples we need P(i) times the norm of theta to be bigger than either one." "But if P(i) over here, if P1 over here is pretty small, that means that we need the norm of theta to be pretty large, right?" "If" "P1 of theta is small and we want P1 you know times in all of theta to be bigger than either one, well the only way for that to be true for the profit that these two numbers to be large" "if P1 is small, as we said we want the norm of theta to be large." "And similarly for our negative example, we need P2" "times the norm of theta to be less than or equal to minus one." "And we saw in this example already that P2 is going pretty small negative number, and so the only way for that to happen as well is for the norm of theta to be large, but what we are doing in the optimization" "objective is we are trying to find a setting of parameters where the norm of theta is small, and so you know, so this doesn't seem like such a good direction for the parameter vector and theta." "In contrast, just look at a different decision boundary." "Here, let's say, this SVM chooses that decision boundary." "Now the is going to be very different." "If that is the decision boundary, here is the corresponding direction for theta." "So, with the direction boundary you know, that vertical line that corresponds to it is possible to show using linear algebra that the way to get that green decision boundary is have the vector of theta be at 90 degrees to it," "and now if you look at the projection of your data onto the vector" "x, lets say its before this example is my example of x1." "So when" "I project this on to x, or onto theta, what I find is that this is P1." "That length there is P1." "The other example, that example is and I do the same projection and what I find is that this length here is a" "P2 really that is going to be less than 0." "And you notice that now" "P1 and P2, these lengths of the projections are going to be much bigger, and so if we still need to enforce these constraints that P1 of the norm of theta is phase number one because P1 is so much bigger now." "The normal can be smaller." "And so, what this means is that by choosing the decision boundary shown on the right instead of on the left, the" "SVM can make the norm of the parameters theta much smaller." "So, if we can make the norm of theta smaller and therefore make the squared norm of theta smaller, which is why the SVM would choose this hypothesis on the right instead." "And this is how the SVM gives rise to this large margin certification effect." "Mainly, if you look at this green line, if you look at this green hypothesis we want the projections of my positive and negative examples onto theta to be large, and the only way for that to hold true this is if surrounding the green line." "There's this large margin, there's this large gap that separates" "positive and negative examples is really the magnitude of this gap." "The magnitude of this margin is exactly the values of" "P1, P2, P3 and so on." "And so by making the margin large, by these tyros P1," "P2, P3 and so on that's the SVM can end up with a smaller value for the norm of theta which is what it is trying to do in the objective." "And this is why this machine ends up with enlarge margin classifiers because itss trying to maximize the norm of these P1 which is the distance from the training examples to the decision boundary." "Finally, we did this whole derivation using this simplification that the parameter theta 0 must be equal to 0." "The effect of that as" "I mentioned briefly, is that if theta 0 is equal to" "0 what that means is that we are entertaining decision boundaries that pass through the origins of decision boundaries pass through the origin like that, if you allow theta zero to be non 0 then what that means is that you entertain the" "decision boundaries that did not cross through the origin, like that one I just drew." "And I'm not going to do the full derivation that." "It turns out that this same large margin proof works in pretty much in exactly the same way." "And there's a generalization of this argument that we just went through them long ago through that shows that even when theta" "0 is non 0, what the SVM is trying to do when you have this optimization objective." "Which again corresponds to the case of when C is very large." "But it is possible to show that, you know, when theta is not equal to 0 this support vector machine is still finding is really trying to find the large margin separator that between the positive and negative examples." "So that explains how this support vector machine is a large margin classifier." "In the next video we will start to talk about how to take some of these" "SVM ideas and start to apply them to build a complex nonlinear classifiers." "In this video, I'd like to start adapting support vector machines in order to develop complex nonlinear classifiers." "The main technique for doing that is something called kernels." "Let's see what this kernels are and how to use them." "If you have a training set that looks like this, and you want to find a nonlinear decision boundary to distinguish the positive and negative examples, maybe a decision boundary that looks like that." "One way to do so is to come up with a set of complex polynomial features, right?" "So, set of features that looks like this, so that you end up with a hypothesis X that predicts 1 if you know that theta 0 and plus theta 1 X1 plus dot dot dot all those polynomial features is" "greater than 0, and predict 0, otherwise." "And another way of writing this, to introduce a level of new notation that" "I'll use later, is that we can think of a hypothesis as computing a decision boundary using this." "So, theta 0 plus theta 1 f1 plus theta 2, f2 plus theta" "3, f3 plus and so on." "Where I'm going to use this new denotation f1, f2, f3 and so on to denote these new sort of features" "that I'm computing, so f1 is just X1, f2 is equal to X2, f3 is equal to this one here." "So, X1X2." "So, f4 is equal to" "X1 squared where f5 is to be x2 squared and so on and we seen previously that coming up with these high order polynomials is one way to come up with lots more features," "the question is, is there a different choice of features or is there better sort of features than this high order polynomials because you know it's not clear that this high order polynomial is what we want, and what we talked about" "computer vision talk about when the input is an image with lots of pixels." "We also saw how using high order polynomials becomes very computationally expensive because there are a lot of these higher order polynomial terms." "So, is there a different or a better choice of the features that we can use to plug into this sort of hypothesis form." "So, here is one idea for how to define new features f1, f2, f3." "On this line I am going to define only three new features, but for real problems we can get to define a much larger number." "But here's what I'm going to do in this phase of features X1, X2, and" "I'm going to leave X0 out of this, the interceptor X0, but in this phase X1 X2, I'm going to just, you know, manually pick a few points, and then call these points l1, we" "are going to pick a different point, let's call that l2 and let's pick the third one and call this one l3, and for now let's just say that I'm going to choose these three points manually." "I'm going to call these three points line ups, so line up one, two, three." "What I'm going to do is define my new features as follows, given an example X, let me define my first feature f1 to be some measure of the similarity between my training example X and my first landmark and" "this specific formula that I'm going to use to measure similarity is going to be this is E to the minus the length of" "X minus l1, squared, divided by two sigma squared." "So, depending on whether or not you watched the previous optional video, this notation, you know, this is the length of the vector" "W. And so, this thing here, this X minus l1, this is actually just the euclidean distance" "squared, is the euclidean distance between the point x and the landmark l1." "We will see more about this later." "But that's my first feature, and my second feature f2 is going to be, you know, similarity function that measures how similar X is to l2 and the game is going to be defined as the following function." "This is E to the minus of the square of the euclidean distance between X and the second landmark, that is what the enumerator is and then divided by 2 sigma squared and similarly f3 is, you know," "similarity between X and l3, which is equal to, again, similar formula." "And what this similarity function is, the mathematical term for this, is that this is going to be a kernel function." "And the specific kernel I'm using here, this is actually called a Gaussian kernel." "And so this formula, this particular choice of similarity function is called a Gaussian kernel." "But the way the terminology goes is that, you know, in the abstract these different similarity functions are called kernels and we can have different similarity functions" "and the specific example I'm giving here is called the Gaussian kernel." "We'll see other examples of other kernels." "But for now just think of these as similarity functions." "And so, instead of writing similarity between" "X and I, sometimes we also write this a kernel denoted you know, lower case k between x and one of my landmarks all right." "So let's see what a criminals actually do and why these sorts of similarity functions, why these expressions might make sense." "So let's take my first landmark." "My landmark l1, which is one of those points I chose on my figure just now." "So the similarity of the kernel between x and l1 is given by this expression." "Just to make sure, you know, we are on the same page about what the numerator term is, the numerator can also be written as a sum from" "J equals 1 through N on sort of the distance." "So this is the component wise distance between the vector X and the vector l." "And again for the purpose of these slides I'm ignoring X0." "So just ignoring the intercept term X0, which is always equal to 1." "So, you know, this is how you compute the kernel with similarity between X and a landmark." "So let's see what this function does." "Suppose X is close to one of the landmarks." "Then this euclidean distance formula and the numerator will be close to 0, right." "So, that is this term here, the distance was great, the distance using X and 0 will be close to zero, and so" "f1, this is a simple feature, will be approximately E to the minus 0 and then the numerator squared over 2 is equal to squared so that E to the" "0, E to minus 0," "E to 0 is going to be close to one." "And I'll put the approximation symbol here because the distance may not be exactly 0, but if X is closer to landmark this term will be close to 0 and so f1 would be close 1." "Conversely, if X is far from 01 then this first feature f1 will be E to the minus of some large number squared, divided divided by two sigma squared and E to the minus of a large number is going to be close to 0." "So what these features do is they measure how similar X is from one of your landmarks and the feature f is going to be close to one when X is close to your landmark and is going to be 0 or close" "to zero when X is far from your landmark." "Each of these landmarks." "On the previous line, I drew three landmarks, l1, l2,l3." "Each of these landmarks, defines a new feature f1, f2 and f3." "That is, given the the training example X, we can now compute three new features: f1, f2, and f3, given, you know, the three landmarks that I wrote just now." "But first, let's look at this exponentiation function, let's look at this similarity function and plot in some figures and just, you know, understand better what this really looks like." "For this example, let's say I have two features X1 and X2." "And let's say my first landmark, l1 is at a location, 3 5." "So and let's say I set sigma squared equals one for now." "If I plot what this feature looks like, what I get is this figure." "So the vertical axis, the height of the surface is the value" "of f1 and down here on the horizontal axis are, if" "I have some training example, and there is x1 and there is x2." "Given a certain training example, the training example here which shows the value of x1 and x2 at a height above the surface, shows the corresponding value of f1 and down below this is the same figure I had showed," "using a quantifiable plot, with x1 on horizontal axis, x2 on horizontal axis and so, this figure on the bottom is just a contour plot of the 3D surface." "You notice that when" "X is equal to 3 5 exactly, then we the f1 takes on the value 1, because that's at the maximum and X moves away as X goes further away then this feature takes on values" "that are close to 0." "And so, this is really a feature, f1 measures, you know, how close X is to the first landmark and if varies between 0 and one depending on how close X is to the first landmark l1." "Now the other was due on this slide is show the effects of varying this parameter sigma squared." "So, sigma squared is the parameter of the" "Gaussian kernel and as you vary it, you get slightly different effects." "Let's set sigma squared to be equal to 0.5 and see what we get." "We set sigma square to 0.5, what you find is that the kernel looks similar, except for the width of the bump becomes narrower." "The contours shrink a bit too." "So if sigma squared equals to 0.5 then as you start from X equals 3" "5 and as you move away, then the feature f1 falls to zero much more rapidly and conversely," "if you has increase since where three in that case and as I move away from, you know I." "So this point here is really I, right, that's l1 is at location 3 5, right." "So it's shown up here." "And if sigma squared is large, then as you move away from l1, the value of the feature falls away much more slowly." "So, given this definition of the features, let's see what source of hypothesis we can learn." "Given the training example X, we are going to compute these features" "f1, f2, f3 and a hypothesis is going to predict one when theta 0 plus theta 1 f1 plus theta 2 f2, and so on is greater than or equal to 0." "For this particular example, let's say that I've already found a learning algorithm and let's say that, you know, somehow I ended up with these values of the parameter." "So if theta 0 equals minus 0.5, theta 1 equals" "1, theta 2 equals 1, and theta 3 equals 0 And what" "I want to do is consider what happens if we have a training example that takes" "has location at this magenta dot, right where I just drew this dot over here." "So let's say I have a training example X, what would my hypothesis predict?" "Well, If I look at this formula." "Because my training example X is close to l1, we have that f1 is going to be close to 1 the because my training example X is far from l2 and l3 I have that, you know, f2 would be close to" "0 and f3 will be close to 0." "So, if I look at that formula, I have theta" "0 plus theta 1 times 1 plus theta 2 times some value." "Not exactly 0, but let's say close to 0." "Then plus theta 3 times something close to 0." "And this is going to be equal to plugging in these values now." "So, that gives minus 0.5 plus 1 times 1 which is 1, and so on." "Which is equal to 0.5 which is greater than or equal to 0." "So, at this point, we're going to predict Y equals" "1, because that's greater than or equal to zero." "Now let's take a different point." "Now lets' say I take a different point, I'm going to draw this one in a different color, in cyan say, for a point out there, if that were my training example X, then" "if you make a similar computation, you find that f1, f2," "Ff3 are all going to be close to 0." "And so, we have theta 0 plus theta 1, f1, plus so on and this will be about equal to minus 0.5, because theta" "0 is minus 0.5 and f1, f2, f3 are all zero." "So this will be minus 0.5, this is less than zero." "And so, at this point out there, we're going to predict Y equals zero." "And if you do this yourself for a range of different points, be sure to convince yourself that if you have a training example that's close to L2, say, then at this point we'll also predict Y equals one." "And in fact, what you end up doing is, you know, if you look around this boundary, this space, what we'll find is that for points near l1 and l2 we end up predicting positive." "And for points far away from l1 and l2, that's for points far away from these two landmarks, we end up predicting that the class is equal to 0." "As so, what we end up doing,is that the decision boundary of this hypothesis would end up looking something like this where inside this red decision boundary would predict Y equals" "1 and outside we predict" "Y equals 0." "And so this is how with this definition of the landmarks and of the kernel function." "We can learn pretty complex non-linear decision boundary, like what I just drew where we predict positive when we're close to either one of the two landmarks." "And we predict negative when we're very far away from any of the landmarks." "And so this is part of the idea of kernels of and how we use them with the support vector machine, which is that we define these extra features using landmarks and similarity functions to learn more complex nonlinear classifiers." "So hopefully that gives you a sense of the idea of kernels and how we could use it to define new features for the Support Vector Machine." "But there are a couple of questions that we haven't answered yet." "One is, how do we get these landmarks?" "How do we choose these landmarks?" "And another is, what other similarity functions, if any, can we use other than the one we talked about, which is called the Gaussian kernel." "In the next video we give answers to these questions and put everything together to show how support vector machines with kernels can be a powerful way to learn complex nonlinear functions." "In the last video, we started to talk about the kernels idea and how it can be used to define new features for the support vector machine." "In this video, I'd like to throw in some of the missing details and, also, say a few words about how to use these ideas in practice." "Such as, how they pertain to, for example, the bias variance trade-off in support vector machines." "In the last video, I talked about the process of picking a few landmarks." "You know, l1, l2, l3 and that allowed us to define the similarity function also called the kernel or in this example if you have this similarity function this is a Gaussian kernel." "And that allowed us to build this form of a hypothesis function." "But where do we get these landmarks from?" "Where do we get l1, l2, l3 from?" "And it seems, also, that for complex learning problems, maybe we want a lot more landmarks than just three of them that we might choose by hand." "So in practice this is how the landmarks are chosen which is that given the machine learning problem." "We have some data set of some some positive and negative examples." "So, this is the idea here which is that we're gonna take the examples and for every training example that we have, we are just going to call it." "We're just going to put landmarks as exactly the same locations as the training examples." "So if I have one training example if that is x1, well then I'm going to choose this is my first landmark to be at xactly the same location as my first training example." "And if I have a different training example x2." "Well we're going to set the second landmark to be the location of my second training example." "On the figure on the right, I used red and blue dots just as illustration, the color of this figure, the color of the dots on the figure on the right is not significant." "But what I'm going to end up with using this method is I'm going to end up with m landmarks of l1, l2" "down to I(m) if I have m training examples with one landmark per location of my per location of each of my training examples." "And this is nice because it is saying that my features are basically going to measure how close an example is to one of the things I saw in my training set." "So, just to write this outline a little more concretely, given m training examples, I'm going to choose the the location of my landmarks to be exactly near the locations of my m training examples." "When you are given example x, and in this example x can be something in the training set, it can be something in the cross validation set, or it can be something in the test set." "Given an example x we are going to compute, you know, these features as so f1, f2, and so on." "Where l1 is actually equal to x1 and so on." "And these then give me a feature vector." "So let me write f as the feature vector." "I'm going to take these f1, f2 and so on, and just group them into feature vector." "Take those down to fm." "And, you know, just by convention." "If we want, we can add an extra feature f0, which is always equal to 1." "So this plays a role similar to what we had previously." "For x0, which was our interceptor." "So, for example, if we have a training example x(i), y(i)," "the features we would compute for this training example will be as follows: given x(i), we will then map it to, you know, f1(i)." "Which is the similarity." "I'm going to abbreviate as SIM instead of writing out the whole word" "similarity, right?" "And f2(i) equals the similarity between x(i) and l2, and so on, down to fm(i) equals" "the similarity between x(i) and I(m)." "And somewhere in the middle." "Somewhere in this list, you know, at the i-th component, I will actually have one feature component which is f subscript i(i), which is going to be the similarity" "between x and I(i)." "Where I(i) is equal to x(i), and so you know fi(i) is just going to be the similarity between x and itself." "And if you're using the Gaussian kernel this is actually e to the minus 0 over 2 sigma squared and so, this will be equal to 1 and that's okay." "So one of my features for this training example is going to be equal to 1." "And then similar to what I have above." "I can take all of these m features and group them into a feature vector." "So instead of representing my example, using, you know, x(i) which is this what" "R(n) plus R(n) one dimensional vector." "Depending on whether you can set terms, is either R(n) or R(n) plus 1." "We can now instead represent my training example using this feature vector f." "I am going to write this f superscript i." "Which is going to be taking all of these things and stacking them into a vector." "So, f1(i) down to fm(i) and if you want and well, usually we'll also add this f0(i), where f0(i) is equal to 1." "And so this vector here gives me my new feature vector with which to represent my training example." "So given these kernels and similarity functions, here's how we use a simple vector machine." "If you already have a learning set of parameters theta, then if you given a value of x and you want to make a prediction." "What we do is we compute the features f, which is now an R(m) plus 1 dimensional feature vector." "And we have m here because we have m training examples and thus m landmarks and what we do is we predict" "1 if theta transpose f is greater than or equal to 0." "Right." "So, if theta transpose f, of course, that's just equal to theta 0, f0 plus theta 1, f1 plus dot dot dot, plus theta m f(m)." "And so my parameter vector theta is also now going to be an m plus 1 dimensional vector." "And we have m here because where the number of landmarks is equal to the training set size." "So m was the training set size and now, the parameter vector theta is going to be m plus one dimensional." "So that's how you make a prediction if you already have a setting for the parameter's theta." "How do you get the parameter's theta?" "Well you do that using the" "SVM learning algorithm, and specifically what you do is you would solve this minimization problem." "You've minimized the parameter's theta of C times this cost function which we had before." "Only now, instead of looking there instead of making predictions using theta transpose x(i) using our original features, x(i)." "Instead we've taken the features x(i) and replace them with a new features" "so we are using theta transpose f(i) to make a prediction on the i'f training examples and we see that, you know, in both places here and it's by solving this minimization problem that you get the parameters for your Support Vector Machine." "And one last detail is because this optimization problem we really have n equals m features." "That is here." "The number of features we have." "Really, the effective number of features we have is dimension of f." "So that n is actually going to be equal to m." "So, if you want to, you can think of this as a sum, this really is a sum from j equals 1 through m." "And then one way to think about this, is you can think of it as n being equal to m, because if f isn't a new feature, then we have m plus 1 features, with the plus 1 coming from the interceptor." "And here, we still do sum from j equal 1 through n, because similar to our earlier videos on regularization, we still do not regularize the parameter theta zero, which is why this is a sum for" "j equals 1 through m instead of j equals zero though m." "So that's the support vector machine learning algorithm." "That's one sort of, mathematical detail aside that I should mention, which is that in the way the support vector machine is implemented, this last term is actually done a little bit differently." "So you don't really need to know about this last detail in order to use support vector machines, and in fact the equations that are written down here should give you all the intuitions that should need." "But in the way the support vector machine is implemented, you know, that term, the sum of j of theta j squared right?" "Another way to write this is this can be written as theta transpose theta if we ignore the parameter theta 0." "So theta 1 down to theta m." "Ignoring theta 0." "Then this sum of j of theta j squared that this can also be written theta transpose theta." "And what most support vector machine implementations do is actually replace this theta transpose theta, will instead, theta transpose times some matrix inside, that depends on the kernel you use, times theta." "And so this gives us a slightly different distance metric." "We'll use a slightly different measure instead of minimizing exactly" "the norm of theta squared means that minimize something slightly similar to it." "That's like a rescale version of the parameter vector theta that depends on the kernel." "But this is kind of a mathematical detail." "That allows the support vector machine software to run much more efficiently." "And the reason the support vector machine does this is with this modification." "It allows it to scale to much bigger training sets." "Because for example, if you have a training set with 10,000 training examples." "Then, you know, the way we define landmarks, we end up with 10,000 landmarks." "And so theta becomes 10,000 dimensional." "And maybe that works, but when m becomes really, really big then solving for all of these parameters, you know, if m were" "50,000 or a 100,000 then solving for all of these parameters can become expensive for the support vector machine optimization software, thus solving the minimization problem that I drew here." "So kind of as mathematical detail, which again you really don't need to know about." "It actually modifies that last term a little bit to optimize something slightly different than just minimizing the norm squared of theta squared, of theta." "But if you want, you can feel free to think of this as an kind of a n implementational detail that does change the objective a bit, but is done primarily for reasons of computational efficiency, so usually you don't really have to worry about this." "And by the way, in case your wondering why we don't apply the kernel's idea to other algorithms as well like logistic regression, it turns out that if you want, you can actually apply the kernel's" "idea and define the source of features using landmarks and so on for logistic regression." "But the computational tricks that apply for support vector machines don't generalize well to other algorithms like logistic regression." "And so, using kernels with logistic regression is going too very slow, whereas, because of computational tricks, like that embodied and how it modifies this and the details of how the support vector machine software is implemented, support vector machines and" "kernels tend go particularly well together." "Whereas, logistic regression and kernels, you know, you can do it, but this would run very slowly." "And it won't be able to take advantage of advanced optimization techniques that people have figured out for the particular case of running a support vector machine with a kernel." "But all this pertains only to how you actually implement software to minimize the cost function." "I will say more about that in the next video, but you really don't need to know about how to write software to minimize this cost function because you can find very good off the shelf software for doing so." "And just as, you know, I wouldn't recommend writing code to invert a matrix or to compute a square root, I actually do not recommend writing software to minimize this cost function yourself, but instead to use off" "the shelf software packages that people have developed and so those software packages already embody these numerical optimization tricks," "so you don't really have to worry about them." "But one other thing that is worth knowing about is when you're applying a support vector machine, how do you choose the parameters of the support vector machine?" "And the last thing I want to do in this video is say a little word about the bias and variance trade offs when using a support vector machine." "When using an SVM, one of the things you need to choose is the parameter C which was in the optimization objective, and you recall that C played a role similar to 1 over lambda, where lambda was the regularization" "parameter we had for logistic regression." "So, if you have a large value of C, this corresponds to what we have back in logistic regression, of a small value of lambda meaning of not using much regularization." "And if you do that, you tend to have a hypothesis with lower bias and higher variance." "Whereas if you use a smaller value of C then this corresponds to when we are using logistic regression with a large value of lambda and that corresponds to a hypothesis with higher bias and lower variance." "And so, hypothesis with large" "C has a higher variance, and is more prone to overfitting, whereas hypothesis with small C has higher bias and is thus more prone to underfitting." "So this parameter C is one of the parameters we need to choose." "The other one is the parameter sigma squared, which appeared in the Gaussian kernel." "So if the Gaussian kernel sigma squared is large, then in the similarity function, which was this you know E to the minus x minus landmark" "varies squared over 2 sigma squared." "In this one of the example;" "If I have only one feature, x1, if" "I have a landmark there at that location, if sigma squared is large, then, you know, the" "Gaussian kernel would tend to fall off relatively slowly" "and so this would be my feature f(i), and so this would be smoother function that varies more smoothly, and so this will give you a hypothesis with higher bias and lower variance, because the Gaussian kernel that falls off smoothly," "you tend to get a hypothesis that varies slowly, or varies smoothly as you change the input x." "Whereas in contrast, if sigma squared was small and if that's my landmark given my 1 feature x1, you know, my Gaussian kernel, my similarity function, will vary more abruptly." "And in both cases I'd pick out 1, and so if sigma squared is small, then my features vary less smoothly." "So if it's just higher slopes or higher derivatives here." "And using this, you end up fitting hypotheses of lower bias and you can have higher variance." "And if you look at this week's points exercise, you actually get to play around with some of these ideas yourself and see these effects yourself." "So, that was the support vector machine with kernels algorithm." "And hopefully this discussion of bias and variance will give you some sense of how you can expect this algorithm to behave as well." "So far we've been talking about" "SVMs in a fairly abstract level." "In this video I'd like to talk about what you actually need to do in order to run or to use an SVM." "The support vector machine algorithm poses a particular optimization problem." "But as I briefly mentioned in an earlier video, I really do not recommend writing your own software to solve for the parameter's theta yourself." "So just as today, very few of us, or maybe almost essentially none of us would think of writing code ourselves to invert a matrix or take a square root of a number, and so on." "We just, you know, call some library function to do that." "In the same way, the software for solving the SVM optimization problem is very complex, and there have been researchers that have been doing essentially numerical optimization research for many years." "So you come up with good software libraries and good software packages to do this." "And then strongly recommend just using one of the highly optimized software libraries rather than trying to implement something yourself." "And there are lots of good software libraries out there." "The two that I happen to use the most often are the linear SVM but there are really lots of good software libraries for doing this that you know, you can link to many of the major programming languages that you" "may be using to code up learning algorithm." "Even though you shouldn't be writing your own SVM optimization software, there are a few things you need to do, though." "First is to come up with with some choice of the parameter's C. We talked a little bit of the bias/variance properties of this in the earlier video." "Second, you also need to choose the kernel or the similarity function that you want to use." "So one choice might be if we decide not to use any kernel." "And the idea of no kernel is also called a linear kernel." "So if someone says, I use an SVM with a linear kernel, what that means is you know, they use an SVM without using without using a kernel and it was a version of the SVM that just uses theta transpose X, right," "that predicts 1 theta 0 plus theta 1 X1 plus so on plus theta" "N, X N is greater than equals 0." "This term linear kernel, you can think of this as you know this is the version of the SVM" "that just gives you a standard linear classifier." "So that would be one reasonable choice for some problems, and you know, there would be many software libraries, like linear, was one example, out of many, one example of a software library that can train an SVM" "without using a kernel, also called a linear kernel." "So, why would you want to do this?" "If you have a large number of features, if N is large, and M the number of training examples is small, then you know you have a huge number of features that if X, this is an X is an Rn, Rn +1." "So if you have a huge number of features already, with a small training set, you know, maybe you want to just fit a linear decision boundary and not try to fit a very complicated nonlinear function, because might not have enough data." "And you might risk overfitting, if you're trying to fit a very complicated function in a very high dimensional feature space, but if your training set sample is small." "So this would be one reasonable setting where you might decide to just not use a kernel, or equivalents to use what's called a linear kernel." "A second choice for the kernel that you might make, is this Gaussian kernel, and this is what we had previously." "And if you do this, then the other choice you need to make is to choose this parameter sigma squared when we also talk a little bit about the bias variance tradeoffs" "of how, if sigma squared is large, then you tend to have a higher bias, lower variance classifier, but if sigma squared is small, then you have a higher variance, lower bias classifier." "So when would you choose a Gaussian kernel?" "Well, if your omission of features X, I mean" "Rn, and if N is small, and, ideally, you know," "if n is large, right, so that's if, you know, we have say, a two-dimensional training set, like the example I drew earlier." "So n is equal to 2, but we have a pretty large training set." "So, you know, I've drawn in a fairly large number of training examples, then maybe you want to use a kernel to fit a more complex nonlinear decision boundary, and the Gaussian kernel would be a fine way to do this." "I'll say more towards the end of the video, a little bit more about when you might choose a linear kernel, a Gaussian kernel and so on." "But if concretely, if you decide to use a Gaussian kernel, then here's what you need to do." "Depending on what support vector machine software package you use, it may ask you to implement a kernel function, or to implement the similarity function." "So if you're using an octave or MATLAB implementation of an SVM, it may ask you to provide a function to compute a particular feature of the kernel." "So this is really computing f subscript i for one particular value of i, where f here is just a single real number, so maybe" "I should move this better written f(i), but what you need to do is to write a kernel function that takes this input, you know," "a training example or a test example whatever it takes in some vector X and takes as input one of the landmarks and but only I've come down X1 and" "X2 here, because the landmarks are really training examples as well." "But what you need to do is write software that takes this input, you know, X1, X2 and computes this sort of similarity function between them and return a real number." "And so what some support vector machine packages do is expect you to provide this kernel function that take this input you know, X1, X2 and returns a real number." "And then it will take it from there and it will automatically generate all the features, and so automatically take X and map it to f1, f2, down to f(m) using this function that you write, and" "generate all the features and train the support vector machine from there." "But sometimes you do need to provide this function yourself." "Other if you are using the Gaussian kernel, some SVM implementations will also include the Gaussian kernel and a few other kernels as well, since the Gaussian kernel is probably the most common kernel." "Gaussian and linear kernels are really the two most popular kernels by far." "Just one implementational note." "If you have features of very different scales, it is important" "to perform feature scaling before using the Gaussian kernel." "And here's why." "If you imagine the computing the norm between X and I, right, so this term here, and the numerator term over there." "What this is doing, the norm between X and I, that's really saying, you know, let's compute the vector" "V, which is equal to" "X minus l." "And then let's compute the norm does vector V, which is the difference between X. So the norm of V is really" "equal to V1 squared plus V2 squared plus dot dot dot, plus Vn squared." "Because here X is in" "Rn, or Rn plus 1, but I'm going to ignore, you know, X0." "So, let's pretend X is an Rn, square on the left side is what makes this correct." "So this is equal to that, right?" "And so written differently, this is going to be X1 minus l1 squared, plus x2 minus l2 squared, plus dot dot dot plus Xn minus ln squared." "And now if your features take on very different ranges of value." "So take a housing prediction, for example, if your data is some data about houses." "And if X is in the range of thousands of square feet, for the first feature, X1." "But if your second feature, X2 is the number of bedrooms." "So if this is in the range of one to five bedrooms, then" "X1 minus l1 is going to be huge." "This could be like a thousand squared, whereas X2 minus l2 is going to be much smaller and if that's the case, then in this term," "those distances will be almost essentially dominated by the sizes of the houses" "and the number of bathrooms would be largely ignored." "As so as, to avoid this in order to make a machine work well, do perform future scaling." "And that will sure that the SVM gives, you know, comparable amount of attention to all of your different features, and not just to in this example to size of houses were big movement here the features." "When you try a support vector machines chances are by far the two most common kernels you use will be the linear kernel, meaning no kernel, or the Gaussian kernel that we talked about." "And just one note of warning which is that not all similarity functions you might come up with are valid kernels." "And the Gaussian kernel and the linear kernel and other kernels that you sometimes others will use, all of them need to satisfy a technical condition." "It's called Mercer's Theorem and the reason you need to this is because support vector machine algorithms or implementations of the" "SVM have lots of clever numerical optimization tricks." "In order to solve for the parameter's theta efficiently and in the original design envisaged, those are decision made to restrict our attention only to kernels that satisfy this technical condition called Mercer's Theorem." "And what that does is, that makes sure that all of these" "SVM packages, all of these SVM software packages can use the large class of optimizations and get the parameter theta very quickly." "So, what most people end up doing is using either the linear or Gaussian kernel, but there are a few other kernels that also satisfy Mercer's theorem and that you may run across other people using, although I personally" "end up using other kernels you know, very, very rarely, if at all." "Just to mention some of the other kernels that you may run across." "One is the polynomial kernel." "And for that the similarity between" "X and I is defined as, there are a lot of options, you can take X transpose I squared." "So, here's one measure of how similar X and I are." "If X and I are very close with each other, then the inner product will tend to be large." "And so, you know, this is a slightly unusual kernel." "That is not used that often, but you may run across some people using it." "This is one version of a polynomial kernel." "Another is X transpose I cubed." "These are all examples of the polynomial kernel." "X transpose I plus 1 cubed." "X transpose I plus maybe a number different then one 5 and, you know, to the power of 4 and" "so the polynomial kernel actually has two parameters." "One is, what number do you add over here?" "It could be 0." "This is really plus 0 over there, as well as what's the degree of the polynomial over there." "So the degree power and these numbers." "And the more general form of the polynomial kernel is X transpose I, plus some constant and then to some degree in the" "X1 and so both of these are parameters for the polynomial kernel." "So the polynomial kernel almost always or usually performs worse." "And the Gaussian kernel does not use that much, but this is just something that you may run across." "Usually it is used only for data where X and I are all strictly non negative, and so that ensures that these inner products are never negative." "And this captures the intuition that" "X and I are very similar to each other, then maybe the inter product between them will be large." "They have some other properties as well but people tend not to use it much." "And then, depending on what you're doing, there are other, sort of more" "esoteric kernels as well, that you may come across." "You know, there's a string kernel, this is sometimes used if your input data is text strings or other types of strings." "There are things like the chi-square kernel, the histogram intersection kernel, and so on." "There are sort of more esoteric kernels that you can use to measure similarity between different objects." "So for example, if you're trying to do some sort of text classification problem, where the input x is a string then maybe we want to find the similarity between two strings using the string kernel, but I" "personally you know end up very rarely, if at all, using these more esoteric kernels." "I think I might have use the chi-square kernel, may be once in my life and the histogram kernel, may be once or twice in my life." "I've actually never used the string kernel myself." "But in case you've run across this in other applications." "You know, if you do a quick web search we do a quick Google search or quick Bing search you should have found definitions that these are the kernels as well." "So" "just two last details I want to talk about in this video." "One in multiclass classification." "So, you have four classes or more generally 3 classes output some appropriate decision bounday between your multiple classes." "Most SVM, many SVM packages already have built-in multiclass classification functionality." "So if your using a pattern like that, you just use the both that functionality and that should work fine." "Otherwise, one way to do this is to use the one versus all method that we talked about when we are developing logistic regression." "So what you do is you trade kSVM's if you have k classes, one to distinguish each of the classes from the rest." "And this would give you k parameter vectors, so this will give you, upi lmpw. theta 1, which is trying to distinguish class y equals one from all of the other classes, then you get the second parameter, vector" "theta 2, which is what you get when you, you know, have y equals 2 as the positive class and all the others as negative class and so on up to a parameter vector theta k, which is the parameter vector for" "distinguishing the final class key from anything else, and then lastly, this is exactly the same as the one versus all method we have for logistic regression." "Where we you just predict the class i with the largest theta transpose X. So let's multiclass classification designate." "For the more common cases that there is a good chance that whatever software package you use, you know, there will be a reasonable chance that are already have built in multiclass classification functionality, and so you don't need to worry about this result." "Finally, we developed support vector machines starting off with logistic regression and then modifying the cost function a little bit." "The last thing we want to do in this video is, just say a little bit about." "when you will use one of these two algorithms, so let's say n is the number of features and m is the number of training examples." "So, when should we use one algorithm versus the other?" "Well, if n is larger relative to your training set size, so for example," "if you take a business with a number of features this is much larger than m and this might be, for example, if you have a text classification problem, where you know, the dimension of the feature" "vector is I don't know, maybe, 10 thousand." "And if your training set size is maybe 10 you know, maybe, up to 1000." "So, imagine a spam classification problem, where email spam, where you have 10,000 features corresponding to 10,000 words but you have, you know, maybe 10 training examples or maybe up to 1,000 examples." "So if n is large relative to m, then what I would usually do is use logistic regression or use it as the m without a kernel or use it with a linear kernel." "Because, if you have so many features with smaller training sets, you know, a linear function will probably do fine, and you don't have really enough data to fit a very complicated nonlinear function." "Now if is n is small and m is intermediate what I mean by this is n is maybe anywhere from 1 - 1000, 1 would be very small." "But maybe up to 1000 features and if the number of training examples is maybe anywhere from" "10, you know, 10 to maybe up to 10,000 examples." "Maybe up to 50,000 examples." "If m is pretty big like maybe 10,000 but not a million." "Right?" "So if m is an intermediate size then often an SVM with a linear kernel will work well." "We talked about this early as well, with the one concrete example, this would be if you have a two dimensional training set." "So, if n is equal to 2 where you have, you know, drawing in a pretty large number of training examples." "So Gaussian kernel will do a pretty good job separating positive and negative classes." "One third setting that's of interest is if n is small but m is large." "So if n is you know, again maybe 1 to 1000, could be larger." "But if m was, maybe 50,000 and greater to millions." "So, 50,000, a 100,000, million, trillion." "You have very very large training set sizes, right." "So if this is the case, then a SVM of the" "Gaussian Kernel will be somewhat slow to run." "Today's SVM packages, if you're using a Gaussian Kernel, tend to struggle a bit." "If you have, you know, maybe 50 thousands okay, but if you have a million training examples, maybe or even a 100,000 with a massive value of m." "Today's" "SVM packages are very good, but they can still struggle a little bit when you have a massive, massive trainings that size when using a Gaussian Kernel." "So in that case, what I would usually do is try to just manually create have more features and then use logistic regression or an SVM without the Kernel." "And in case you look at this slide and you see logistic regression or SVM without a kernel." "In both of these places, I kind of paired them together." "There's a reason for that, is that logistic regression and SVM without the kernel, those are really pretty similar algorithms and, you know, either logistic regression or SVM without a kernel will usually do pretty similar things and give" "pretty similar performance, but depending on your implementational details, one may be more efficient than the other." "But, where one of these algorithms applies, logistic regression where SVM without a kernel, the other one is to likely to work pretty well as well." "But along with the power of the SVM is when you use different kernels to learn complex nonlinear functions." "And this regime, you know, when you have maybe up to 10,000 examples, maybe up to 50,000." "And your number of features, this is reasonably large." "That's a very common regime and maybe that's a regime where a support vector machine with a kernel kernel will shine." "You can do things that are much harder to do that will need logistic regression." "And finally, where do neural networks fit in?" "Well for all of these problems, for all of these different regimes, a well designed neural network is likely to work well as well." "The one disadvantage, or the one reason that might not sometimes use the neural network is that, for some of these problems, the neural network might be slow to train." "But if you have a very good" "SVM implementation package, that could run faster, quite a bit faster than your neural network." "And, although we didn't show this earlier, it turns out that the optimization problem that the" "SVM has is a convex optimization problem and so the good SVM optimization software packages will always find the global minimum or something close to it." "And so for the SVM you don't need to worry about local optima." "In practice local optima aren't a huge problem for neural networks but they all solve, so this is one less thing to worry about if you're using an SVM." "And depending on your problem, the neural network may be slower, especially in this sort of regime than the SVM." "In case the guidelines they gave here, seem a little bit vague and if you're looking at some problems, you know," "the guidelines are a bit vague, I'm still not entirely sure, should I use this algorithm or that algorithm, that's actually okay." "When I face a machine learning problem, you know, sometimes its actually just not clear whether that's the best algorithm to use, but as you saw in the earlier videos, really, you know, the algorithm does" "matter, but what often matters even more is things like, how much data do you have." "And how skilled are you, how good are you at doing error analysis and debugging learning algorithms, figuring out how to design new features and figuring out what other features to give you learning algorithms and so on." "And often those things will matter more than what you are using logistic regression or an SVM." "But having said that, the SVM is still widely perceived as one of the most powerful learning algorithms, and there is this regime of when there's a very effective way to learn complex non linear functions." "And so I actually, together with logistic regressions, neural networks, SVM's, using those to speed learning algorithms you're I think very well positioned to build state of the art you know, machine learning systems for a wide" "region for applications and this is another very powerful tool to have in your arsenal." "One that is used all over the place in Silicon Valley, or in industry and in the Academia, to build many high performance machine learning system." "In this video, I'd like to start to talk about clustering." "This will be exciting because this is our first unsupervised learning algorithm where we learn from unlabeled data instead of the label data." "So, what is unsupervised learning?" "I briefly talked about unsupervised learning at the beginning of the class but it's useful to contrast it with supervised learning." "So, here's a typical supervisory problem where we are given a label training sets and the goal is to find the decision boundary that separates the positive label examples and the negative label examples." "The supervised learning problem in this case is given a set of labels to fit a hypothesis to it." "In contrast, in the unsupervised learning problem, we're given data that does not have any labels associated with it." "So we're given data that looks like this, here's a set of points and then no labels." "And so our training set is written just x1, x2 and so on up to x(m) and we don't get any labels y." "And that's why the points plotted up on the figure don't have any labels on them." "So in unsupervised learning, what we do is, we give this sort of unlabeled training set to an algorithm and we just ask the algorithm: find some structure in the data for us." "Given this data set, one type of structure we might have an algorithm find, is that it looks like this data set has points grouped into two separate clusters and so an algorithm that finds that clusters like the ones I just" "circled, is called a clustering algorithm." "And this will be our first type of unsupervised learning, although there will be other types of unsupervised learning algorithms that we'll talk about later that finds other types of structure or other types of patterns in the data other than clusters." "We'll talk about this afterwards, we will talk about clustering." "So what is clustering good for?" "Early in this class I had already mentioned a few applications." "One is market segmentation, where you may have a database of customers and want to group them into different market segments." "So, you can sell to them separately or serve your different market segments better." "Social network analysis, there are actually, you know, groups that have done this, things like looking at a group of people, social networks, so things like Facebook, Google plus or maybe information about who" "are the people that you email the most frequently and who are the people that they email the most frequently, and to find coherent groups of people." "So, this would be another maybe clustering algorithm where, you know, you'd want to find who other coherent groups of friends in a social network." "Here's something that one of my friends actually worked on, which is, use clustering to organize compute clusters or to organize data centers better because, if you know which computers in the data center are in the cluster there tend to work together." "You can use that to reorganize your resources and how you lay out its network and how design your data center and communications." "And lastly something that, actually another thing I worked on, using clustering algorithms to understand galaxy formation and using that to understand how, to understand astronomical detail." "So, that's clustering which is our first example of an unsupervised learning algorithm." "In the next video, we'll start to talk about a specific clustering algorithm." "In the clustering problem we are given an unlabeled data set and we would like to have an algorithm automatically group the data into coherent subsets or into coherent clusters for us." "The K Means algorithm is by far the most popular, by far the most widely used clustering algorithm, and in this video I would like to tell you what the K Means Algorithm is and how it works." "The K means clustering algorithm is best illustrated in pictures." "Let's say I want to take an unlabeled data set like the one shown here, and I want to group the data into two clusters." "If I run the K Means clustering algorithm, here is what" "I'm going to do." "The first step is to randomly initialize two points, called the cluster centroids." "So, these two crosses here, these are called the Cluster Centroids" "and I have two of them because I want to group my data into two clusters." "K Means is an iterative algorithm and it does two things." "First is a cluster assignment step, and second is a move centroid step." "So, let me tell you what those things mean." "The first of the two steps in the loop of K means, is this cluster assignment step." "What that means is that, it's going through each of the examples, each of these green dots shown here and depending on whether it's closer to the red cluster centroid or the blue cluster centroid, it is going" "to assign each of the data points to one of the two cluster centroids." "Specifically, what I mean by that, is to go through your data set and color each of the points either red or blue, depending on whether it is closer to the red cluster centroid or the blue cluster centroid, and I've done that in this diagram here." "So, that was the cluster assignment step." "The other part of K means, in the loop of K means, is the move centroid step, and what we are going to do is, we are going to take the two cluster centroids, that is, the red cross and" "the blue cross, and we are going to move them to the average of the points colored the same colour." "So what we are going to do is look at all the red points and compute the average, really the mean of the location of all the red points, and we are going to move the red cluster centroid there." "And the same things for the blue cluster centroid, look at all the blue dots and compute their mean, and then move the blue cluster centroid there." "So, let me do that now." "We're going to move the cluster centroids as follows and I've now moved them to their new means." "The red one moved like that and the blue one moved like that and the red one moved like that." "And then we go back to another cluster assignment step, so we're again going to look at all of my unlabeled examples and depending on whether it's closer the red or the blue cluster centroid," "I'm going to color them either red or blue." "I'm going to assign each point to one of the two cluster centroids, so let me do that now." "And so the colors of some of the points just changed." "And then I'm going to do another move centroid step." "So I'm going to compute the average of all the blue points, compute the average of all the red points and move my cluster centroids like this, and so, let's do that again." "Let me do one more cluster assignment step." "So colour each point red or blue, based on what it's closer to and then do another move centroid step and we're done." "And in fact if you keep running additional iterations of" "K means from here the cluster centroids will not change any further and the colours of the points will not change any further." "And so, this is the, at this point," "K means has converged and it's done a pretty good job finding" "the two clusters in this data." "Let's write out the K means algorithm more formally." "The K means algorithm takes two inputs." "One is a parameter K, which is the number of clusters you want to find in the data." "I'll later say how we might go about trying to choose k, but for now let's just say that we've decided we want a certain number of clusters and we're going to tell the algorithm how many" "clusters we think there are in the data set." "And then K means also takes as input this sort of unlabeled training set of just the Xs and because this is unsupervised learning, we don't have the labels Y anymore." "And for unsupervised learning of the K means I'm going to use the convention that XI is an RN dimensional vector." "And that's why my training examples are now N dimensional rather N plus one dimensional vectors." "This is what the K means algorithm does." "The first step is that it randomly initializes k cluster centroids which we will call mu 1, mu 2, up to mu k." "And so in the earlier diagram, the cluster centroids corresponded to the location of the red cross and the location of the blue cross." "So there we had two cluster centroids, so maybe the red cross was mu 1 and the blue cross was mu" "2, and more generally we would have k cluster centroids rather than just 2." "Then the inner loop of k means does the following, we're going to repeatedly do the following." "First for each of my training examples, I'm going to set this variable CI to be the index 1 through" "K of the cluster centroid closest to XI." "So this was my cluster assignment step, where we took each of my examples and coloured it either red or blue, depending on which cluster centroid it was closest to." "So CI is going to be a number from 1 to" "K that tells us, you know, is it closer to the red cross or is it closer to the blue cross," "and another way of writing this is I'm going to, to compute Ci, I'm going to take my Ith example Xi and and" "I'm going to measure it's distance to each of my cluster centroids, this is mu and then lower-case k, right, so capital K is the total number centroids and I'm going to use lower case k here" "to index into the different centroids." "But so, Ci is going to, I'm going to minimize over my values of k and find the value of K that minimizes this distance between Xi and the cluster centroid, and then, you know, the" "value of k that minimizes this, that's what gets set in" "Ci." "So, here's another way of writing out what Ci is." "If I write the norm between" "Xi minus Mu-k, then this is the distance between my ith training example" "Xi and the cluster centroid" "Mu subscript K, this is--this here, that's a lowercase K. So uppercase" "K is going to be used to denote the total number of cluster centroids, and this lowercase K's a number between one and capital K. I'm just using lower case K to index into my different cluster centroids." "Next is lower case k." "So that's the distance between the example and the cluster centroid and so what I'm going to do is find the value of K, of lower case k that minimizes this, and so the value of k that minimizes you know," "that's what I'm going to set as Ci, and by convention here I've written the distance between Xi and the cluster centroid, by convention people actually tend to write this as the squared distance." "So we think of Ci as picking the cluster centroid with the smallest squared distance to my training example Xi." "But of course minimizing squared distance, and minimizing distance that should give you the same value of Ci, but we usually put in the square there, just as the convention that people use for K means." "So that was the cluster assignment step." "The other in the loop of K means does the move centroid step." "And what that does is for each of my cluster centroids, so for lower case k equals 1 through" "K, it sets Mu-k equals to the average of the points assigned to cluster." "So as a concrete example, let's say that one of my cluster centroids, let's say cluster centroid two, has training examples, you know, 1, 5, 6, and 10 assigned to it." "And what this means is, really this means that C1 equals" "to C5 equals to" "C6 equals to and similarly well c10 equals, too, right?" "If we got that from the cluster assignment step, then that means examples 1,5,6 and" "10 were assigned to the cluster centroid two." "Then in this move centroid step, what I'm going to do is just compute the average of these four things." "So X1 plus X5 plus X6 plus X10." "And now I'm going to average them so here I have four points assigned to this cluster centroid, just take one quarter of that." "And now Mu2 is going to be an n-dimensional vector." "Because each of these example x1, x5, x6, x10" "each of them were an n-dimensional vector, and I'm going to add up these things and, you know, divide by four because I have four points assigned to this cluster centroid, I end up with my move centroid step," "for my cluster centroid mu-2." "This has the effect of moving mu-2 to the average of the four points listed here." "One thing that I've asked is, well here we said, let's let mu-k be the average of the points assigned to the cluster." "But what if there is a cluster centroid no points with zero points assigned to it." "In that case the more common thing to do is to just eliminate that cluster centroid." "And if you do that, you end up with K minus one clusters" "instead of k clusters." "Sometimes if you really need k clusters, then the other thing you can do if you have a cluster centroid with no points assigned to it is you can just randomly reinitialize that cluster centroid, but it's more" "common to just eliminate a cluster if somewhere during" "K means it with no points assigned to that cluster centroid, and that can happen, altthough in practice it happens not that often." "So that's the K means Algorithm." "Before wrapping up this video" "I just want to tell you about one other common application of K Means and that's to the problems with non well separated clusters." "Here's what I mean." "So far we've been picturing K Means and applying it to data sets like that shown here where we have three pretty well separated clusters, and we'd like an algorithm to find maybe the 3 clusters for us." "But it turns out that very often K Means is also applied to data sets that look like this where there may not be several very well separated clusters." "Here is an example application, to t-shirt sizing." "Let's say you are a t-shirt manufacturer you've done is you've gone to the population that you want to sell t-shirts to, and you've collected a number of examples of the height and weight of these people in your" "population and so, well I guess height and weight tend to be positively highlighted so maybe you end up with a data set like this, you know, with a sample or set of examples of different peoples heights and weight." "Let's say you want to size your t shirts." "Let's say I want to design and sell t shirts of three sizes, small, medium and large." "So how big should I make my small one?" "How big should I my medium?" "And how big should I make my large t-shirts." "One way to do that would to be to run my k means clustering logarithm on this data set that I have shown on the right and maybe what" "K Means will do is group all of these points into one cluster and group all of these points into a second cluster and group all of those points into a third cluster." "So, even though the data, you know, before hand it didn't seem like we had 3 well separated clusters, K Means will kind of separate out the data into multiple pluses for you." "And what you can do is then look at this first population of people and look at them and, you know, look at the height and weight, and try to design a small t-shirt so that it kind" "of fits this first population of people well and then design a medium t-shirt and design a large t-shirt." "And this is in fact kind of an example of market segmentation" "where you're using K Means to separate your market into 3 different segments." "So you can design a product separately that is a small, medium, and large t-shirts," "that tries to suit the needs of each of your" "3 separate sub-populations well." "So that's the K Means algorithm." "And by now you should know how to implement the K" "Means Algorithm and kind of get it to work for some problems." "But in the next few videos what I want to do is really get more deeply into the nuts and bolts of K means and to talk a bit about how to actually get this to work really well." "Most of the supervised learning algorithms we've seen, things like linear regression, logistic regression and so on." "All of those algorithms have an optimization objective or some cost function that the algorithm was trying to minimize." "It turns out that K-means also has an optimization objective or a cost function that is trying to minimize." "And in this video, I'd like to tell you what that optimization objective is." "And the reason I want to do so is because this will be useful to us for two purposes." "First, knowing what is the optimization objective of K-means will help us to debug the learning algorithm and just make sure that K-means is running correctly, and second, and perhaps even more importantly, in a later video we'll talk" "about how we can use this to help K-means find better clusters and avoid local optima, but we'll do that in a later video that follows this one." "Just as a quick reminder, while K-means is running we're going to be" "keeping track of two sets of variables." "First is the CI's and that keeps track of the index or the number of the cluster" "to which an example x(i) is currently assigned." "And then, the other set of variables we use as Mu subscript" "K, which is the location of cluster centroid K. And, again, for K-means we use capital K to denote the total number of clusters." "And here lower case K, you know, is going to be an index into the cluster centroids, and so lower case k is going to be a number between 1 and capital K. Now, here's one more bit of notation which" "is going to use Mu subscript c(i) to denote the cluster centroid of the cluster to which example x(i) has been assigned and to explain that notation a little bit more, let's say that x(i) has been" "assigned to cluster number five." "What that means is that c(i), that is the index of x(i), that that is equal to 5." "Right?" "Because you know, having c(i) equals 5, that's what it means for the example x(i) to be assigned to cluster number 5." "And so Mu subscript c(i) is going to be equal to Mu subscript" "5 because c(i) is equal to 5." "This Mu substitute c(i) is the cluster centroid of cluster number" "5, which is the cluster to which my example x(i) has been assigned." "With this notation, we're now ready to write out what is the optimization objective of the K Mu clustering algorithm." "And here it is." "The cost function that K-means is minimizing is the function J of all of these parameters c1 through cm, Mu1 through MuK, that" "K-means is varying as the algorithm runs." "And the optimization objective is shown on the right, is the average of one over M of the sum of i equals one through M of this term here" "that I've just drawn the red box around." "The squared distance between each example x(i) and the location of the cluster centroid to which x(i)" "has been assigned." "So let me just draw this in, let me explain this." "Here is the location of training example x(i), and here's the location of the cluster centroid to which example x(i) has been assigned." "So to explain this in pictures, if here is X1, X2." "And if a point here, is my example x(i), so if that is equal to my example x(i)," "and if x(i) has been assigned to some cluster centroid, and" "I'll denote my cluster centroid with a cross." "So if that's the location of, you know, Mu 5, let's say, if x(i) has been assigned to cluster centroid 5 in my example up there." "Then, the squared distance, that's the squared of the distance" "between the point x(i) and this cluster centroid, to which x(i) has been assigned." "And what K-means can be shown to be doing is that, it is trying to find parameters c(i) and Mu(i), try to find cMU to try to minimize this cost function J." "This cost function is sometimes also called the distortion cost function or the distortion of the K-means algorithm." "And, just to provide a little bit more detail, here's the K-means algorithm," "Here's exactly the algorithm as we have it, in the real form the earlier slide." "And what this first step of this algorithm is, this was the cluster assignment step" "where we assign each point to the cluster centroid, and it's possible to show mathematically that what the cluster assignment step is doing is exactly minimizing" "J with respect to the variables, C1, C2 and so on, up to C(m), while holding the closest centroids, MU1 up to" "MUK fixed." "So, what the first assignment step does is you know, it doesn't change the cluster centroids, but what it's doing is, exactly picking the values of C1, C2, up to CM" "that minimizes the cost function or the distortion function," "J. And it's possible to prove that mathematically but, I won't do so here." "That has a pretty intuitive meaning of just, yo know, well let's assign these points to the cluster centroid that is closest to it, because that's what minimizes the square of distance between the points and the corresponding cluster centroid." "And then the other part of the second step of K-means, this second step over here." "This second step was the move centroid step and, once again, I won't prove it, but it can be shown mathematically, that what the root centroid step does, is it chooses the values of mu that minimizes J." "So it minimizes the cost function" "J with respect to, where wrt is my abbreviation for with respect to." "But it minimizes J with respect to the locations of the cluster centroids, Mu1 through MuK." "So, what K-means really is doing is it's taking the two sets of variables and partitioning them into two halves right here." "First the C set of variables and then you have the Mu sets of variables." "And what it does is it first minimizes J with respect to the variable C, and then minimizes" "J with respect the variables" "Mu, and then it keeps on iterating." "And so that's all that K-means does." "And, now that we understand" "K-means, let's try to minimize this cost function J. We can also use this to try to debug our learning algorithm and just kind of make sure that our implementation of K-means is running correctly." "So, we now understand the" "K-means algorithm as trying to optimize this cost function J, which is also called the dispulsion function." "We can use that to debug K-means and help me show that K-means is converging, and that it's running properly, and in the next video, we'll also see how we can us this to help K-means find better clusters" "and help K-means to avoid local optima."