Image by Editor
Python Pandas is an open-source toolkit which provides data scientists and analysts with data manipulation and analysis capabilities using the Python programming language. The Pandas library is very popular in the preprocessing phase of machine learning and deep learning. But now you can do more with it…
Incoming a new data science library – Pandas AI. A Python library that integrates generative artificial intelligence capabilities into Pandas, making data frames conversational.
What does making data frames conversational mean?
This means exactly what it says – you can speak with your dataset. Yes, you heard it, you can talk to your data and get fast responses. As a data scientist or analyst, you won’t need to be staring at your dataset, skimming through rows and columns for endless hours anymore. Pandas AI does not replace Pandas, it just gives it a big push!
Data scientists and analysts spend a lot of time cleaning data for the analysis phase. They will now be able to take their data analysis to the next level. Data professionals look into different methods and processes that they can use to minimize the time spent on data preparation, and now they can with Pandas AI.
PandasAI is to be used hand-in-hand with Pandas, it is not a replacement for Pandas. Rather than having to skim through and answer questions about the dataset yourself, you can ask PandasAI these questions and it will return answers in the form of Pandas DataFrames.
With that being said, does this mean that people no longer need to be proficient in Python to achieve data analysis using tools such as the Pandas library?
With the help of OpenAI API, Pandas AI aims to achieve the goal of virtually talking with a machine to output the results you want rather than having to program the task yourself. The machine will output the result in their language – machine-interpretable code (DataFrame).
Installing Pandas AI using pip
Importing PandasAI with OpenAI
In order to make use of the new Pandas AI library, you will need an OpenAI key. Once you start on your notebook, you will need to import the following:
import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
llm = OpenAI(api_token=your_API_key)
If you do not have a unique OpenAI API key, you can create an account on the OpenAI platform and create an API key here. You will receive a $5 credit that can be used towards exploring and experimenting with the API.
Once you are all set up, you’re ready to start using Pandas AI.
Running the Model on Your Dataframe
First, you will need to run your OpenAI model to Pandas AI:
pandas_ai = PandasAI(openAImodel)
You will then need to run the model on the data frame, which consists of ??two parameters the data frame you’re working with and the question you want to ask:
pandas_ai.run(df, prompt="the question you would like to ask?")
For example, you may be looking through your dataset and are interested in the rows where the value of a column is greater than 5. You can do this by using Pandas AI:
import pandas as pd
from pandasai import PandasAI
# Sample DataFrame
df = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})
# Instantiate a LLM
from pandasai.llm.openai import OpenAI
llm = OpenAI()
pandas_ai = PandasAI(llm)
pandas_ai.run(df, prompt="Which are the 5 happiest countries?")
It will return a DataFrame output:
6 Canada
7 Australia
1 United Kingdom
3 Germany
0 United States
Name: country, dtype: object
It also has the ability to perform more complex queries, such as mathematical calculations and data visualizations.
A data visualization example:
pandas_ai.run(
df,
"Plot the histogram of countries showing for each the gpd, using different colors for each bar",
)
Data visualization output:
Image by PandasAI
Pandas AI is very new, and the team are still looking at ways to improve the library. As of the 10th of May, they still have the following on their todo list:
- Add support for more LLMs
- Make PandasAI available from a CLI
- Create a web interface for PandasAI
- Add unit tests
They are welcome to suggestions and contributions. If you are interested in contributing to the growth of Pandas AI, please refer to the contributing guidelines.
If you would like to see a walk-through of using Pandas AI, check out this video:
Although Pandas AI does not replace Pandas, it is a good tool to have to boost your workflow. Although you can ask Pandas AI questions about your dataset, you will still need to be proficient in programming to correct and direct the library when it makes mistakes.
If you’ve had a chance to play around with Pandas AI, let us know what you think about it in the comments below!
Nisha Arya is a Data Scientist, Freelance Technical Writer and Community Manager at KDnuggets. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.