Can real-world programming problems be solved by cutting-edge AI? In February, DeepMind tackled this question, confronting the world with a new perspective on programming and the capabilities and limitations of artificial intelligence.
But what's equally interesting are the lessons they've learned along the way - about what can and can't be automated, and about the errors in our current data sets.
And while the AI-generated solutions were no better than those of human programmers, they have already raised some questions about what this means for the future.
A promising new competitor
Based in London, DeepMind, the AI subsidiary of Alphabet, Google's parent company, has already achieved historic milestones, outperforming humans in chess and Go, and proving better at predicting how proteins fold.
This month, DeepMind announced that it has also developed a system called AlphaCode to compete in programming contests, evaluating its performance in 10 different programming contests hosted by the competitive programming site CodeForces - each with at least 5,000 different entrants.
The results? AlphaCode "ranked about the same as the median competitor," reports a DeepMind blog post, "marking the first time an AI code generation system has achieved a competitive level of performance in programming contests."
DeepMind noted that real-world companies use these competitions for recruiting - and present similar problems to job applicants in coding interviews.
In the blog post, Mike Mirzayanov, founder of CodeForces, is quoted as saying that AlphaCode's results exceeded his expectations. He added: "I was skeptical, because even in simple competitive problems, it is often necessary not only to implement the algorithm, but also (and this is the hardest part) to invent it."
"AlphaCode has managed to rise to the level of a promising new competitor. I can't wait to see what's in store!"
A paper by the DeepMinds researchers acknowledges that it took an enormous amount of computing power. A petaFLOP is a whopping 1,000,000,000,000 floating point operations per second. A petaflop day maintains this rate for every second of a 24-hour day, or a total of about 86,400,000,000,000,000,000,000 operations.
"Sampling and training our model required hundreds of petaflop days".
A footnote adds that Google's data centers that perform these operations "Purchase renewable energy equal to the amount consumed."
How AlphaCode works
The researchers explain their results in a 73-page paper (not yet published or peer-reviewed). The authors write that their system was first "pre-trained" on code in public GitHub repositories, much like the AI-based code suggestion tool Copilot. (To avoid some of the controversies that have arisen around Copilot's methodology, AlphaCode filtered the datasets it trained on, selecting code that was released under permissive licenses.)
The researchers then "fine-tuned" their system on a small dataset of competitive programming problems, solutions, and even test cases, many of which were pulled directly from the CodeForces platform.
One thing they discovered? There's a problem with the currently available datasets of programming competition problems and solutions. At least 30% of these programs pass all the tests, but are not actually correct.
So the researchers created a dataset with a larger number of test cases to rigorously check the accuracy of the programs. They believe this significantly reduces the number of incorrect programs that still pass all the tests - from 30% to just 4%.
When the time finally comes to compete on the programming challenges, "we create a massive amount of C++ and Python programs for each problem" the DeepMind blog post states. "Then, we filter, group and re-rank these solutions into a small set of 10 candidate programs that we submit for external evaluation."
"The problem-solving capabilities required to excel in these competitions exceed the capabilities of existing AI systems" the DeepMind blog post argues, crediting "advances in large-scale transformation models (which have recently shown promising capabilities for generating code)" combined with "large-scale sampling and filtering."
According to the blog post, the researchers' results demonstrate the potential of deep learning even for tasks that require critical thinking - expressing solutions to problems in code. In its blog post, DeepMind describes the system as part of the company's mission to "solve intelligence" (which its website describes as "developing more general and better problem-solving systems" - also known as general artificial intelligence).
The blog post adds, "We hope our results will inspire the competitive programming community".
Reaction from human programmers
DeepMind's blog post also includes comments from Petr Mitrichev, identified as both a Google software engineer and a "world-class" competitive programmer, who was impressed that AlphaCode could even make progress in this area.
"Solving competitive programming problems is a really hard thing to do, requiring both good coding skills and creative problem solving" said Mitrichev.
Mitrichev also provided feedback for six of the solutions, noting that several submissions also included pieces of code that were "useless but harmless".
In one of the solutions, AlphaCode declared an integer variable named x, then never used it. In another graph traversal case, AlphaCode unnecessarily sorted all adjacent vertices first (based on the depth of the graph they lead to). For another problem (requiring a computationally intensive brute-force solution), AlphaCode's extra code made its solution 32 times slower.
In fact, AlphaCode often just implemented a massive brute-force solution, writes Mitrichev.
But the AI system fails even as a programmer, Mitrichev notes, citing a submission where, when the solution eludes it, AlphaCode "behaves a bit like a desperate human."It actually wrote code that simply always provides the same answer as provided in the example scenario of the problem, he wrote, "hoping that it works in all other cases."
"Humans do this too, and such hope is almost always wrong - as it is in this case".
How good were AlphaCode's results? CodeForce calculates a programmer's rating (using the standard Elo rating system also used to rank chess players) - and AlphaCode got a rating of 1,238.
But what's more interesting is to see where that score appears on a graph of all programmers competing on CodeForce over the past six months. The researchers' paper notes that AlphaCode's estimated score "is in the top 28 percent among these users".
Not everyone was impressed. Dzmitry Bahdanau, an AI researcher and assistant professor at McGill University in Montreal, pointed out on Twitter that many CodeForce participants are high school or college students - and that the time constraints on their problem solving have less impact on a pre-trained AI system.
Most importantly, AlphaCode's process involves filtering through a torrent of AI-generated code to find one that actually solves the problem at hand, so that "the vast majority of the programs AlphaCode generates are wrong."
So while this is a promising direction to explore, Bahdanau doesn't think it's a programming milestone: "It's not AlphaGo in terms of beating humans, nor is it AlphaFold in terms of revolutionizing an entire field of science. We still have work to do."