

Images by the author | chatgpt
# introduction
Getting real-world data from a data science project is often the most difficult part. Toy datasets are easy to find, but for high quality or real-time data, you usually need to use APIs or build custom scraping pipelines to extract information from the web.
In this article, we share 10 of our favorite free APIs. This is something I use every day to collect data, integrate data, and build AI agents. These APIs are organized into five categories, spanning trusted data repositories, web scraping and web search, allowing you to quickly select the right tools and quickly navigate from data to insights.
# Basic Data Repository
Basic Data Repositories are a community-based platform where various organizations and open source contributors share datasets with the wider world. Simple commands allow you to access these datasets in your project.
// 1. KaggleAPI
Kaggle Datasets are extremely popular when working on data science projects. Instead of downloading them manually, you can create a data pipeline that automatically downloads the dataset and unzips them and loads them into your workspace.
These datasets are shared by the open source community for everyone to use. To get started, generate an API key from your Kaggle account and set it as an environment variable. You can then run the following command on your terminal: Kaggle also provides the Python SDK. This makes it easy to integrate with your code.
kaggle datasets download -d kingabzpro/world-vaccine-progress -p data --unzip
// 2. Hugging the CLI on the face
Similar to Kaggle Hugging my face It is also a data science and machine learning community that shares datasets, models and demos. Hug Face CLI can be easily installed and integrated into your workflow using either CLI commands or Python code. Both options allow you to download datasets without the need for an API key.
The API key is only required if the dataset is gated.
hf download kingabzpro/dermatology-qa-firecrawl-dataset
# Web and Crawling API
The web contains a variety of data. If you can't find the information you need on the above platforms, you may need to either scrape the web or curate your own data using the Web Search API.
// 3. Firecrawl
Firecrawl It provides an API for extracting content from a website and converting it to a markdown format to facilitate AI integration. It also comes with a scraping and extraction API integrated with LLM (Large Language Model) for advanced web scraping options.
This API is a must-have item. I use it every day to integrate it into data creation and AI projects.
curl -s -X POST "https://api.firecrawl.dev/v2/scrape" \
-H "Authorization: Bearer $FIRECRAWL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://abid.work",
"formats": ["markdown", "html"]
}'
// 4. Tabilly
Tabilly is a fast web search API that offers 1,000 search requests per month for free. Accurate and quick. It can be used to create datasets, integrate them into AI projects, or use them as a simple search API for your development needs.
curl --request POST \
--url https://api.tavily.com/search \
--header "Authorization: Bearer " \
--header "Content-Type: application/json" \
--data '{
"query": "who is Leo Messi?",
"auto_parameters": false,
"topic": "general",
"search_depth": "basic",
"chunks_per_source": 3,
"max_results": 1,
"days": 7,
"include_answer": true,
"include_raw_content": true,
"include_images": false,
"include_image_descriptions": false,
"include_favicon": false,
"include_domains": [],
"exclude_domains": [],
"country": null
}'
# Geospatial and Weather API
If you are looking for weather or geospatial datasets, you will see that things continue to change. Therefore, real-time access to these datasets is required through APIs.
// 5. OpenWeathermap
OpenWeathermap It is a service that provides global weather data via APIs, including current conditions, forecasts, Nowcasts, historical records, and even high local precipitation forecasts per minute.
curl "https://api.openweathermap.org/data/2.5/weather?q=London&appid=YOUR_API_KEY&units=metric"
// 6. OPENSTREETMAP
openstreetmap Provides world map data API Overpass It is a read-only web database that provides custom selected parts of OSM and allows you to query overpath QL. The example below gets a cafe node in a small London bounding box.
curl -G "https://overpass-api.de/api/interpreter" \
--data-urlencode 'data=[out:json];node["amenity"="cafe"](51.50,-0.15,51.52,-0.10);out;'
# Financial Market Data API
The Financial Market Data API is highly recommended if you are working on financial projects and need real-time data on stocks, crypto and other financial information and news.
// 7. Alpha Vantage
Alpha Vantage It is a financial data platform that offers free APIs for real-time and historical market data via stocks, forex, cryptocurrency, commodities and options, with JSON or CSV output. It also provides chart-ready time series during the day, daily, weekly, and monthly intervals, providing over 50 technical indicators for analysis.
curl "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=IBM&apikey=YOUR_API_KEY"
// 8. Yahoo Finance
It is used by many beginners and practitioners yfinance An API that accesses inventory estimates, historical time series data, dividends and splits, and basic metadata. This allows you to create rapid prototypes and analytics-enabled data frames for classroom projects.
Yahoo Finance It offers free stock estimates, news, portfolio tools and international market coverage, allowing users to explore a wide range of market data at no direct cost.
import yfinance as yf
print(yf.download("AAPL", period="1y").head())
# Social and Community Data API
If you are working on a project that analyses text and community conversations from top social media platforms, these APIs provide easy access to real social media data.
// 9. Reddit
reddit Provides a rich, community-driven data source Python Reddit API Wrapper (PRAW) Easily access the official Reddit API for tasks like Python posts, comments, Subreddit metadata and more.
Praw works by sending requests to Reddit's API under the hood and is commonly used in education and research to collect discussion threads for analysis.
import praw
r = praw.Reddit(
client_id="ID",
client_secret="SECRET",
user_agent="myapp:ds-project:v1 (by u/yourname)"
)
print([s.title for s in r.subreddit("Python").hot(limit=5)])
// 10. x
x (formerly known as Twitter) offers real-time data streaming options in addition to a developer platform with REST endpoints for user and content retrieval. Access typically requires authentication, compliance with limit and policy rates, and selection of access layers that are appropriate for your volume and use case.
curl -H "Authorization: Bearer YOUR_BEARER_TOKEN" \
"https://api.x.com/2/users/by/username/jack"
# Final Thoughts
These APIs provide free access to data that is often difficult to retrieve. They will increase their ability to collect web data and improve their web scraping efforts, allowing them to create customized data sets.
We highly recommend bookmark this article and revisit it if you want high quality real-time data from the web. By leveraging these APIs, you can unlock valuable insights to support your research and analysis.
Abid Ali Awan (@1abidaliawan) is a certified data scientist who loves building machine learning models. Currently he focuses on content creation and creates technical blogs on machine learning and data science technology. Abid holds a Masters degree in Technology Management and a Bachelor of Arts degree in Telecommunications Engineering. His vision is to build AI products using graph neural networks for students suffering from mental illness.
