What is the future of data science in the age of generative AI?
As AI systems become more part of our daily lives, the demand for people with the skills to operate and build these systems will continue to grow. In the past, data scientists were essential to building and managing AI systems. But now that AI systems are easier to use and more accessible, are data scientists still key to making AI systems work in most organizations?
AI systems are all about data. It remains important to know how to process data to get results. Data scientists are typically tasked with developing models that transform large amounts of data into insights and patterns. These insights can be used for a variety of activities, from descriptive and diagnostic analytics to advanced machine learning models, which can be applied to all seven patterns of AI.
Data scientists have all the relevant competencies, but they are highly skilled, expensive and hard to find. The rate at which organizations are looking to implement and leverage AI capabilities is far outpacing the market's ability to supply talented and experienced data scientists.
Using and Building AI ModelsWhen thinking about the skill sets needed now and in the future, we must first distinguish between the need to build AI models from scratch and simply use models that have already been developed. The power of generative AI systems and large language models (LLMs) proves that AI capabilities can be easily accessed by anyone and produce impressive results.
You certainly don't need to be a data scientist to get a lot of value from an LLM system, and AI capabilities are increasingly being built into everyday tools and applications, so you don't need data scientist skills just to get value from using AI systems.
Instead, organizations need to develop prompt engineering skills to benefit from off-the-shelf LLM systems. Effective prompt engineering mastery is more about soft skills than hard skills. You don't need skills in math, programming, or statistical analysis to be a good prompt engineer. Prompt engineering requires knowing the appropriate prompt patterns for different situations, as well as strong critical thinking, creativity, collaboration, and communication skills. These liberal arts-focused competencies are more accessible, less expensive, and easier to cultivate in existing talent than data scientists.
Tweaks and RAG: New Skill SetsBut what if you want to take it to the next level? Publicly available models may be suitable for general needs, but they are not suitable for tasks that are not suited to private data, domain- and context-specific requirements, and the kinds of things generative AI models are built for. Of course, these public models are getting better every day, so the scope of what genAI systems can do continues to expand every day. But the problem of private and domain-specific needs remains. These needs require more advanced skills beyond prompt engineering and related soft skills; however, they are not as advanced as machine learning engineering or data science.
If you want to tune a generically trained machine learning model for more domain-specific responses, you can use an approach like this: TweakFine-tuning involves collecting many examples of specific prompts and responses and providing those examples to LLM’s API. For example, to fine-tune an Open AI GPT model with your own data set, you would need to collect example data sets and then use a fairly basic Python script to feed those data sets into the OpenAI API to generate a customized, fine-tuned model.
If you want to make LLM work with your own or custom data, you can use the Search Augmentation Generation (RAG) approach. In RAG, you store your custom data in a database that is indexed using the same word vector approach that underpins LLM. Then, when a user makes a prompt request, you instruct LLM to first retrieve the relevant information from the database based on the request, and then respond to the user's request against the data provided to LLM as part of the prompt context. The skills required to build a RAG are primarily programming skills to coordinate between LLM and the database, and data skills to collect and process the data to be entered into the RAG database.
Data scientists are useful as part of the fine-tuning and RAG development process, but are not as necessary as they are when developing machine learning models from scratch. Because so much can be done with prompt engineering, fine-tuning, and RAG, the set of tasks that require data scientists and machine learning engineers is becoming narrower.
Do data engineers get the recognition they deserve?Data scientists still play a critical role in the continued development and advancement of AI, particularly in building and maintaining the foundational models, as well as the wide range of work that data scientists do outside of AI. However, the common thread that ties advanced model development, rapid engineering, fine-tuning, and RAG development together is the need for high-quality, relevant data.
While data science and the role of the data scientist have been in the spotlight over the past decade, it's clear that data engineering deserves more attention. The primary role of data engineering is to make data available for AI and analytics. Data engineers move data, keep it consistent and clean, and manage the data engineering pipeline to keep it flowing to systems that rely on a continuous stream of good data. From this perspective, data engineers are even more important to AI projects than data scientists. Data engineers may be the most important hire for the next decade as organizations put AI to the test.