Today, when we talk about artificial intelligence (AI) in business and society, what we really mean is machine learning (ML). It refers to applications that use algorithms (sets of instructions) to perform a particular task better and better as more data is associated with that task.
These tasks range from answering questions and creating text and images (proven in apps like ChatGPT and Dall-E) to image recognition (computer vision) and point A to B in self-driving cars. It can be anything from navigating to a point.
All these tasks require data, and companies that want to train their own ML algorithms to automate routine tasks need data sources.
What kind of data do you have?
Business data typically falls into one of two categories: internal data and external data.
Internal data is data collected by the organization itself from within its own operations. This typically includes financial data, customer feedback data, human resources data, operational data, and many other sources. Data collected by an organization by monitoring its own operations is called proprietary data and is valuable because it provides information specific to that business.
External data comes from sources outside your organization, typically collected from third-party data sources such as: When data is freely available to everyone, it is called open data.
In addition to this, data can also be classified as either structured data, unstructured data, or semi-structured data.
Structured data is information that fits nicely in a table. For example, sales data that shows what products a company sold, when, where, and for how much is an example of internal structured data. Alternatively, we may choose to analyze historical market data and economic indicators to predict future movements in the markets in which we operate (Structured External Data).
Unstructured data is photos, videos, text, social media posts, and anything else. They can certainly contain valuable insights, but are more difficult to analyze. However, AI has proven particularly useful in extracting meaning from unstructured data. For example, an image recognition algorithm could tell the company useful facts about customer behavior by analyzing CCTV images of her in the store (internal unstructured data). Analyzing business-related images (unstructured external data) posted on social media can also yield valuable insights.
Fortunately, data is everywhere. No matter what you do, if you need external data, chances are that the source is online. Governments, research institutions, private companies, and non-governmental organizations all routinely make data freely available for research and even commercial purposes. So here are some of the best sources of free online data available in 2023.
Data search engine and repository
Google Dataset Search – This is actually a search engine for datasets cataloged by Google. Use it to find almost any data you need.
AWS Open Data Search – Another dataset search engine, this is powered by Amazon’s AWS service.
Microsoft Research Open Data – Free, open datasets collected by Microsoft, focused primarily on science.
UCI Machine Learning Repository – A repository of over 600 open datasets curated and maintained by the University of California, Irvine, available for training machine learning algorithms.
Kaggle Datasets – The online data science platform Kaggle also offers a curated catalog of datasets covering everything from university rankings to Google search trends, retail sales, online movie reviews and crime statistics.
Reddit R/Datasets – A huge collection of datasets submitted by users of the online community site Reddit, covering literally hundreds of subjects.
Government and intergovernmental organization datasets
Data.Gov – An open data portal provided by the US government. Hosts nearly 250,000 datasets published by all government agencies.
Data.Census.Gov – This is a good place to start if you are specifically looking for US demographic data.
Data.EU – The European Union’s open data portal contains data from EU organizations and data from member state governments.
Data.gov.uk – Open data sets published by UK government agencies.
World Health Organization Data – datasets related to global health and well-being.
World Bank Open Data – datasets related to economic development, international financial markets, social indicators and environmental issues.
image data
Google Open Images – Millions of images categorized and labeled in various ways. Suitable for training many kinds of computer vision algorithms.
ImageNet Open Dataset – Another dataset consisting of labeled images that can be used free of charge for non-commercial machine learning applications.
COCO Dataset – Common Objects in Context (COCO) is a dataset consisting of over 200,000 images selected for training object detection and captioning algorithms.
audio data
Mozilla Common Voice – An open dataset of voice recordings that can be used to train AI applications that include voice.
Audioset – Another dataset curated by Google. It focuses on sound and contains hundreds of thousands of 10-second samples of him sorted into categories such as instruments, vehicles, and vocals.
One Million Songs Dataset – Samples and metadata from one million contemporary popular music tracks.
text data
Wikidata – A database download of Wikipedia articles in various formats.
Common Crawl – An open repository of data collected from the World Wide Web. It is often used to train the GPT large-scale language model that powers ChatGPT and many other chatbots.
Miscellaneous and other datasets
Amazon Reviews – A database of approximately 35 million reviews for Amazon products, including product information and ratings.
Waymo Open Dataset – Alphabet’s self-driving subsidiary, Waymo, is making publicly accessible the vast amount of data collected through self-driving cars, including sensor data from cameras and LiDAR.
Apolloscape Dataset – More self-driving data, this time provided by Baidu’s open-source Apollo platform.
To stay up to date on new business and technology trends, be sure to subscribe and follow our newsletter. twitterLinkedIn, YouTube, and check out my book. Future Skills: 20 Skills and Abilities Everyone Needs to Succeed in the Digital World and The Internet of the Future: How the Metaverse, Web 3.0 and Blockchain Will Transform Business and Society.
follow me twitter Or LinkedIn. check out You can find my website and other works here.
