Netflix Weirdness

by Kieran Healy on November 23, 2008

There’s an article on the Netflix Prize in the Times today. You know, where Netflix made half of its ratings data available to people and offered a million bucks to anyone who could write a recommendation algorithm that would do some specified percent better than Netflix’s own. What tripped me up was this sentence about one of the more successful teams:

The first major breakthrough came less than a month into the competition. A team named Simon Funk vaulted from nowhere into the No. 4 position, improving upon Cinematch by 3.88 percent in one fell swoop. Its secret was a mathematical technique called singular value decomposition. It isn’t new; mathematicians have used it for years to make sense of prodigious chunks of information. But Netflix never thought to try it on movies.

Can this possibly be true? I’d have thought that just about the most obvious way to look for some kind of structure in data like this would be to do a principal components analysis, and PCA is (more or less) just the SVD of a data matrix. PCA is a quite straightforward technique (evidence for this includes the fact that I know about and use it myself). It’s powerful, but it’s not like it’s some kind of slightly obscure method that isn’t ever applied to data of this kind. And there’s a whole family of related and more sophisticated approaches you could use instead. If you’d asked me about the prize before I read this article, I would naively have said “Well, it’s this effort to get people to help Netflix do better than I guess anyone could using something like bog-standard PCA.”

Maybe the article just got written up in a way that misrepresents the contribution of the team who introduced the method to the data. Or maybe I am misunderstanding something. I guess I should page Cosma and see what he thinks.

{ 28 comments }

1

Trey 11.23.08 at 7:42 pm

I had the exact same reaction when I read that article — “REALLY? No one thought to use some kind of factor analysis on what is essentially survey data?”

2

Bill Gardner 11.23.08 at 8:14 pm

Same here. I wrote to a colleague, with the subject line “wtf?”: in “…the history of the competition, the first big improvement in the modeling occured when someone took the persons x films ratings matrix and did an SVD. Apparently, this had never occured to Netflix!” Meaning no one ever took a course in psychometrics? No one ever looked into how PageRank works?

3

Jacob T. Levy 11.23.08 at 8:15 pm

Me too– except of course that I’m an American and so “bog-standard” didn’t enter into my mental phrasing.

4

todd. 11.23.08 at 8:25 pm

It seems most likely to me that SVD occurred to the Netflix people, but that they didn’t know how to do it on that size of a matrix. I think Simon Funk’s principle contribution was knowing of a paper that described a way to handle very large, sparse matrices.

At least, that was the impression I had at the time, but it’s possible that I overestimate the importance (or the obscurity) of the serial update method SF used.

5

JSE 11.23.08 at 8:33 pm

I wrote feature on the Netflix Prize for Wired earlier this year and learned a bit about this. In a straight-up SVD, you have a bunch of vectors in high-dimensional space (i.e. a bunch of subjects measured on some set of variables) and then you plug and chug. In the Netflix data, you have hundreds of thousands of customers ranking tens of thousands of movies, but only a tiny proportion of the pairs (customer, movie) actually has a ranking attached to it. So instead of having a bunch of vectors, you have a bunch of “partial vectors” 99% of whose coordinates are unknown. So it’s bit hairy.

I thought James did a nice job here, actually — this is a hard topic to write about for a non-technical audience, and I think he managed to convey a lot of the sense of what’s going on in the contest.

6

Kieran 11.23.08 at 8:42 pm

OK, right … so it’s a very large, sparse matrix, as todd said. That makes it trickier.

7

tom s. 11.23.08 at 9:12 pm

“I thought James did a nice job here, actually” – actually it was Clive Thompson.
I thought it was well written too. But I did find two things to gripe about so that’s what I’ll do.

1. “Traditional video stores depend on hits; just-out-of-the-theaters blockbusters account for 80 percent of what they rent. At Netflix, by contrast, 70 percent of what it sends out is from the backlist — older movies or small, independent ones.” – This comes from a quote by the CEO of Netflix and was debunked by Lee Gomes in the Wall Street Journal a couple of years ago.

2. The thing with all the psychology and the wonderment at our individual quirkiness at the end overlooks one far more mundane point, which is that an account may be shared among several people with different tastes (kids and parents, for example), so the diversity may not just be a result of individual unpredictability.

8

Kieran Healy 11.23.08 at 9:14 pm

Hey Tom, I finally got around to reading your book last weekend. You will not be surprised to hear that I concur with pretty much everyone else who’s read it: it’s really good.

9

tom s. 11.23.08 at 9:25 pm

Thanks! I’m actually still surprised anyone likes it. By the time I finished, everything seemed either too obvious or too full of holes to be worth saying.

10

JSE 11.23.08 at 9:44 pm

Sorry — I hear the name “Clive,” automatically type “James.”

For what it’s worth, Tom, neither the Netflix folks or the competitors I talked to seemed to think the “multiple users on one account” issue is a substantial obstacle. Not that it’s easy to eliminate — just that its effect is small enough to ignore.

11

Henry 11.23.08 at 10:13 pm

My academic mentor told me, when I was going through a bad patch at the end of my dissertation, that every time he was on the verge of finishing a book, he spent half the time thinking that the only reason no-one had said all this before, was because everyone knew it and didn’t think it was worth saying, and the other half that no-one was saying it because it was worthless and wrong. He told me that the good news was that this was pretty universal among authors, and the bad news was that it would hit me again, and again, and again throughout my career. So this is a general experience rather than a specific one…

12

tom s. 11.23.08 at 10:51 pm

JSE – I now recall reading that the customer side of the picture was pretty well determined, and that the bulk of the uncertainty lies with the titles, as Clive Thompson emphasized. Thanks for the correction. Down to one gripe.

Henry – I guess when I wrote my comment I should have thought: “everyone knows this already.”

13

Matthew Kuzma 11.23.08 at 11:13 pm

I also wouldn’t be surprised if the technical team behind netflix never bothered to do much research and were in fact entertained with solving the problem from scratch. Of course, if that were true I don’t know how the company could go so far as to offer a prize for improvement before bothering to look up published solutions to similar problems.

14

John Holbo 11.24.08 at 1:53 am

The article didn’t discuss one thing I’ve wondered about – and seen discussed in other pieces on the prize. Many accounts have several users. So mom is watching “Sex in the City” and dad is watching “Hellboy 2” and the little girls are watching “Barbie Swan Lake” and the account gives them all 5 stars, and now you’ve got this bogus ‘why do Hellboy and Barbie go together?’ non-problem. Has Netflix done anything to correct for this? By, say, subdividing members of an account household?

15

JSE 11.24.08 at 2:31 am

See my #10 above.

16

nitpicking 11.24.08 at 3:27 am

Not to split hairs, but a common factor model (using something like maximum likelihood estimation) would be more appropriate here, because we can imagine that there are unique factors at play in addition to the common ones (i.e., there is variance that is unique to particular movies, beyond the common factors in the solution). PAF or ML would deal with this better.

17

TheDeadlyShoe 11.24.08 at 3:53 am

I really don’t think family accounts are a problem. You can split up queues between accounts on the fly, so there’s no reason for different people to use the same account. And if there’s just one person managing movies for the whole family, I don’t think that person would ask other people what they thought of the movie they watched and then offer up appropriate ratings. That one person would probably just offer their own thoughts in terms of ratings.

18

john holbo 11.24.08 at 4:43 am

Ah, sorry I missed that, JSE. Yes, DeadlyShoe, what you say makes sense, too.

19

Zamfir 11.24.08 at 10:36 am

The guy has a blog where he explains his methods: http://sifter.org/~simon/journal/index.html

I read only a few small bits, bsed on those as it seems “using SVD” is just a starting point, not the trick itself. Trick one is to use a method that can deal with those large, sparse matrices, by some iterative algorithm that approximates the SVD. This is apparently a straightforward application of existing literature.

The other part is determining what to do with the empty spots, movies that are not reviewed by all people. This is probably where the algorithm is really original, with a lot of heurisitic ideas about assumed probability distributions, where a movie’s score is somehow averaged between the actually observed scores and some a priori assumption about the score. I haven’t looked into the details here.

20

Ginger Yellow 11.24.08 at 1:23 pm

” You can split up queues between accounts on the fly, so there’s no reason for different people to use the same account.”

Maybe not, but they do. People are stubborn like that.

21

Kieran Healy 11.24.08 at 2:38 pm

Zamfir – Interesting. I guess I am still amazed that Netflix didn’t do this themselves. Not only are the methods in the existing literature, the software tools are, too. R, for instance, has several packages (like this one and especially this one) for handing operations on large, sparse matrices.

22

Martin James 11.24.08 at 3:05 pm

Kieran,

In the real business world almost nobody looks at the literature and even fewer know any math. The difference between a good answer and the best answer is vary rarely important enough to justify the expense of hiring people that know math.

For example, with the financial crisis, those with sophisticated models and those with no models ended up pretty much the same place.

23

Zamfir 11.24.08 at 4:43 pm

Kieran, I do not think those standard methods would work on matrices with missing data (which is not the same as having zeros at those places), but I might be wrong. Matlab at least does not seem to have an inbuild method for gappy SVD.

From googling around a bit I get the impression that the algorithm of Simon Funk, or at least its application to these kinds of problems, was in fact new.

24

Thomas 11.24.08 at 7:56 pm

It seems there are two problems:

– computationally, you need an SVD for a large matrix. This isn’t trivial (the power algorithm (as in PageRank) gets the eigenvector for the largest eigenvalue, but it gets harder for the other eigenvalues). Note that the R packages that Kieran mentioned don’t do SVD, they solve (penalized) linear systems.

– you also have missing data, which is less standard. With PageRank there is no missing data: every page either links or does not link to every other page. With Netflix you don’t even have a training sample where everyone has rated every movie. The matrix isn’t sparse in the usual linear algebra sense of being mostly zero, it is 99% missing. He didn’t really do this with any explicit models, he just minimized the approximation error at the observed values. This isn’t completely novel – multidimensional scaling with missing distances has been done that way – but I haven’t seen it done on this scale.

25

Abby 11.25.08 at 3:54 pm

TheDeadlyShoe wrote:
And if there’s just one person managing movies for the whole family, I don’t think that person would ask other people what they thought of the movie they watched and then offer up appropriate ratings. That one person would probably just offer their own thoughts in terms of ratings.
But that iis exactly what I did with our family’s Netflix account. I rated the movies based on how the primary viewer perceived them so that the recommendation system would suggest more movies that each individual might like. Putting in my own rating of movies I hadn’t even wanted to watch in the first place would have made it harder for me to find more movies for the rest of my family.

26

Scott Martens 11.27.08 at 4:47 pm

I haven’t looked at any of the research, but I’ll bet the next step was to use Latent Dirichlet Allocation, which is very good at topic clustering in documents and would, therefore, be a pretty obvious candidate for the task. Then, just iterate over the standard set of co-clustering techniques. I saw one one using a simulated annealing algorithm that did pretty good with topic clustering in newspapers.

Look: stats dorks are pretty thin on the ground and machine learning junkies are freaks right up there with, say, people who understand the proof to Fermat’s last theorem. These are not easy algorithms. It takes either talent, hard work, or both to acquire them, and you can spend your whole life gainfully employed in statistics and not really ever run into SVD or its applications to general learning problems.

27

andrew cooke 11.28.08 at 2:37 am

You can get an intuitive feel for how missing data are a problem by considering how you might implement a naive calculation: calculate the average point; discard that dimension; repeat. Both the calculation of the average and the reduction in dimension become non-linear processes when you have missing data (the missing data have to be ignored; they cannot be simply treated as zero). That would strongly suggest that traditional efficient solutions based on linear matrix operations will not work.

Which is what various peopled have said above, really.

28

Kaleberg 11.28.08 at 3:17 am

My impression is that SVD is just a starting point. There is lots of interesting stuff you can do once you have the basic decomposition. For example, you can do optimal rotations with things like the varimax algorithm. You can also do subspace selection which when combined with rotation can help pick out the signal from the noise. Of course, I’ve only used SVD for remote sensing analysis, so what the hell do I know other than I’ve yet to buy something Amazon has recommended for me.

Comments on this entry are closed.