Dive Brief:
- ChatGPT isn’t as accurate as Stack Overflow users when answering software engineering questions, according to Purdue University research published last week. The researchers compared ChatGPT’s generated responses to 517 user-written questions and answers.
- Stack Overflow serves as a question-and-answer platform for more than 100 million developers and engineers monthly, according to the company. Users can ask a question related to coding and receive an answer from a human, a task that requires waiting for someone who is knowledgeable in the subject to respond.
- More than half of the answers generated by ChatGPT were incorrect, and there were differences related to the format, semantics and syntax of ChatGPT-generated code, according to researchers. More than 3 in 5 ChatGPT responses were more verbose than human-written answers, according to the research.
Dive Insight:
Stack Overflow was one of the first organizations to restrict the use of ChatGPT, but unlike other enterprises that cited data privacy concerns, Stack Overflow was more concerned with accuracy.
Less than a week after the generative AI chatbot launched, the company banned developers and engineers from generating answers using the tool as it feared incorrect answers would lower the credibility of the site.
But as enthusiasm over ChatGPT’s effect on coding and broader IT operations spread, there were concerns that Stack Overflow would lose users to the faster alternative.
One-third of developers believe a productivity boost is the greatest upside to enhancing the software creation process with AI, according to a Stack Overflow survey of nearly 90,000 engineers in June.
The company changed its tune on generative AI when it announced it would begin incorporating the technology into its public platform and paid service in a blog post in April. But users of the site were still concerned about the validity of answers generated by AI, information overload and data privacy as it relates to individual contributors on the platform.
“We aren’t surprised by the research paper’s findings that AI tools can be inaccurate,” said via email Ellen Brandenberger, director of product innovation at Stack Overflow, in reference to the work by Purdue researchers. “For the last several months, our team has been outlining our vision for community and AI coming together as the inevitable next phase of growth in generative AI’s trajectory.”
The company launched OverflowAI last month, which serves as a platform for users to check, validate, attribute and confirm accuracy and trustworthiness across its more than 58 million questions and answers.
Researchers from the University of California, Berkeley, found that in some cases the behavior of OpenAI’s large language models is getting significantly worse over time. When presented with 50 code-generation problems from LeetCode’s easy category, the percentage of executable GPT-4 generated code dropped from 52% in March to 10% in June. GPT-3.5’s performance decreased from 22% to 2%.
Stack Overflow has experienced a small decline in traffic this year, dipping an average of 5% compared to 2022, according to a company blog post last week.
“The future of the internet and the modern tech landscape isn’t going to be measured by web traffic alone — it’s about the quality of content, trust in the content, and the communities of experts and human beings curating the content,” the company said.
The company expects to continue to see traffic fluctuate from historical norms as first-time coders leverage generative AI tools more often and as the technology spurs new questions that will bring users to the platform.
OpenAI did not immediately respond to a request for comment.