May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Blogroll

Blog powered by TypePad
Member since 07/2003

Netflix Prize: best score almost halfway there

The Netflix Prize has generated a lot of buzz in the machine learning world.  They are giving away $1 million for a 10% improvement in prediction accuracy over the Cinematch algorithm they use.  The data set is just too cool.  It has ~100 million movie ratings on 17,770 movies by 480,189 customers.  How can a geek resist playing with that?  The money is nice, but the data set is the real draw. 

As of the last few days, people are making real progress.  On Monday we were all in awe of a submission that was 1.79% better.  As of today, there have been four better submissions. The best is 4.52% better - so we're already almost half way to the million dollar level. Follow along with the current leader here. The contest has a minimum of 3 months to run, so it won't be over soon, it wouldn't surprise me if we have a qualifying entry before then.

Of course, I can't resist working on something like this.  My computer is crunching away on my solution, which will take a few more days. I'm limited most by free evening time.  I can't wait to see how well my first submission will do. 

I'm the "Craig Schmidt" team, or craigschmidt on the message board. I find it kind of strange that most people are using "team names" rather than their own names.  A big part of why you'd do this if for egoboo, so why be anonymous?

Google's MapReduce algorithm

The New York Times had a good article a while back about Google's hardware and software infrastructure.  It said that one of their big innovations was an algorithm called MapReduce:

One such program, called MapReduce, is based on ideas discussed in computer science literature for decades, according to Urs Hölzle, Google's senior vice president for operations. "What surprised us was how useful it turned out to be in our environment," he said.

MapReduce, he said, "allows Joe Schmo software engineer to process large amounts of data and take advantage of our infrastructure."

Mr. Arnold, the consultant, said these tools created a significant cost advantage. "If you talk to guys who work in massively parallel computing operations, as much as 30 percent of their coding time is spent trying to figure out how to get the thing to run," he said. Google "has figured out how they can reduce a lot of the hassle and work of creating parallel applications."

Mr. Gates acknowledged that MapReduce was a significant technology, but he asserted that Microsoft was building its own parallel processing software, opening another front in the technological war between the two companies.

"They did MapReduce; we have this thing called Dryad that's better," Mr. Gates said. "But they'll do one that's better."

Anything with such a snarky quote from Bill Gates has got to be interesting.  So here's the paper that describes it.  It is an interesting approach to parallel computing, well worth a read.

Minimum Sudoku with 17 givens

Last night on the train ride home, I was thinking about the math of Sudoku. I was wondering what the minimum number of givens or clues (i.e. squares with specified numbers) you have to specify in order to have a unique solution.  It turns out to be an open problem, with the smallest known limit as 17.  According to Gordon Royle there are 35396 solutions with 17 givens.

If you think about it much, it is a really hard problem.  What is a good way to represent it having a "unique solution"?

Is the exponent of matrix multiplication 2?

There's an interesting seminar called "Group-Theoretic Algorithms for Matrix Multiplication" coming up at MIT by Bobby Kleinberg.  He's co-authored a paper (pdf!) to be presented at the Foundations of Computer Science 2005 conference.

If you try to multiply two nxn matrices by the straightforward algorithm, it takes O(n^3) operations. For some time, there has been work on reducing that exponent of 3 down toward 2. Since matrix multiplication is used in lots of places, this is a very important theoretical problem.  In practice, n has to be quite large for one of these fancy algorithms to be faster.  The new algorithms often use Fast Fourier Transforms (FFT's), and have chased the exponent down to 2.376. 

This paper further develops a group theoretic approach with an exponent of 2.41 (the first time this approach yields an exponent below 3). More importantly, they state two conjectures that would imply an exponent of 2.  Now all they have to do is prove one of them :-).  And no, I can't follow the math in the paper, just appreciate the significance.

New Biggest Prime Number

There is a new largest known prime number.  It is the 40th Mersenne prime to be found, which are of the form 2^n -1, with n prime.  The largest known primes are usually of this form. In this case it is 2^20996011 - 1.   

Update: As of September 4, 2006, the 44th Mersenne prime was found, 232,582,657-1. 

Check here to be sure this post is up to date.