The Hg Data team explore OpenAI’s Codex that converts human language into code and assesses its value potential for B2B SaaS companies
Only a few months since its trial launch, people are already getting excited by Codex. So far most attention has focused on software engineering, but we in the Hg Data team wanted to understand the potential implications also for Data use cases. In short: we see significant potential, it’s a tool we recommend exploring. Although still early stage, Codex can already boost efficiency and speed for Data teams by allowing a “lower code” approach to more complex data tasks. Data value creation use cases become more accessible, accelerating the commercial impact of your data.
What is Codex and what does it mean for you?
Less than a year after commercialising GPT-3, OpenAI, the world leading AI company founded by Elon Musk, has released Codex for private beta testing. Codex is built on GPT-3, a general language generation model (which Hg previously reviewed here), and is adapted to generate code. Under the hood it has been trained on >10M public code repositories.
Codex translates human language into code. There are a range of use cases, including fixing incorrect syntax, converting between coding languages, and, most impressively, generating full blocks of code from a conversational description of the desired output. It is able to solve 37% of benchmark coding problems based on a natural language prompt like “Create a SQL request to find all customers who have declined in product usage in the last 90 days and have an annual contract value of greater than $20k.”
When it was released in August 2021, the initial focus was on Software Engineering use cases.
We were curious to understand Codex’s impact for Data Engineering and Data Science use cases that we commonly see add material value to B2B SaaS companies.
What does this mean for your Data team in a B2B SaaS company?
Despite recent headlines, OpenAI have been quick to reiterate that Codex will not replace software engineers. The same holds true for its usage in data – Codex will not replace your Data team. However, Codex is a powerful tool that can accelerate typical data use cases we see in B2B SaaS companies across the Hg portfolio, especially for Data teams upskilling and transitioning away from drag and drop no-code solutions as use cases become more sophisticated. Whilst upskilling is helpful, the real impact should come from enabling a “lower code” approach to the most common data coding tasks, for example by accelerating and supporting coding in SQL & Python.
Particularly, GitHub Copilot, a code editor plugin powered by Codex, could be a quick win for improving efficiency. GitHub Copilot can assist data engineers and scientists with more accurate autocompletions spanning multiple lines or whole functions, helping to reduce errors and accelerate output. If the Data team can code 5-10% faster when using Codex, this could equate to one additional data business use case per year. For each data use case at Hg, we are targeting multiple % points of EBITDA impact.
But there is no silver bullet – data engineers coding faster does not solve the common data challenges we see within B2B software companies: unoptimized communication between the business, BI owners and data engineers; scattered tooling choices across a complicated data landscape; poor data quality leading to a lack of trust in data and spiralling complexities; production-ising machine learning models to be integrated in the business requiring robust Dev and MLOps; and challenges in delivering a self-service data culture and the proper data governance. The Hg Data team have developed playbooks and productized solutions with our portfolio companies to address these common challenges and we don’t see AI replacing us here any time soon!
The Hg Data team received privileged beta access to the OpenAI Codex and you can see examples and outputs we achieve below.
We used Codex to perform various Data Engineering tasks, such as generating SQL queries and carry out data transformations. The resulting code was not always right, but often an impressive attempt. Importantly, this output can be leveraged by data engineers to kick-start the development of their own solution. Codex can also be leveraged for various Data Science use cases. However, creating a machine learning model without understanding the underlying data and code is unlikely to be useful. Nevertheless, there are still many repetitive tasks that Codex can help expedite.
The following show two examples of using Codex for a Data Engineering and a Data Science use case.
How could this impact Data Engineering teams?
Data Engineering teams are in high demand with ongoing data requests from the business. These teams are often understaffed as it is hard to find people with the appropriate experience (e.g. SQL, Spark, Python). While low-code tools, such as Matillion or Data Factory, have attempted to reduce the coding skillset required, we are seeing more growth in tools that leverage core coding skills, such as dbt and AWS Glue.
Data Engineering tasks are often highly modular and follow a structured workflow making it easy to breakdown tasks. This is in turn helps Codex to output smaller blocks of correct code which could make it a great tool to upskill, support and accelerate data engineers with implementing a code-based solution and transitioning away from drag and drop tools.
Data transformations can also often be easily described in a few sentences. However, execution may require nuances of the coding language which Codex excels at. As you can see in our experimentation, even if Codex doesn’t return perfect results, the output is a valuable starting point and even a small efficiency gain can bring meaningful benefits.
How could this impact Data Science teams?
Codex can also accelerate scripted Data Science and exploratory Machine Learning (ML) prototyping, which often leverages boilerplate code. We would not ask Codex to build a full churn prediction model in one go. However, it can help accelerate writing the steps of train/test splits, hyper parameter optimisation and other subtasks. Codex is especially useful if your Data Science team works across different languages or frameworks, which is common in companies that have undergone a lot of M&A. Of course, expert input is still required on top to both break down the problem and to verify the output.
A big challenge we currently see in Data Science is moving ML models from the initial exploratory phase to deploying them to production. This MLOps is a focus area for the Hg Data team as we see companies trying to apply DevOps best practices from Software Engineering to their ML pipelines. Codex isn’t going to solve productionising ML, but can be leveraged to accelerate the exploration phase.
Codex is not a replacement for data engineers and data scientists. However, it is a tool to start experimenting with and it can already help your teams today by being more efficient when coding. With rapid innovation, in a few years’ time it might be impossible to imagine your Data team without a Codex-like tool and now is the time to start trying it out.