Actionable data provides a path forward for AI

You may have heard the buzzword “big data” in the context of AI, but what about small data? Whether you realize it or not, small data is all around you, and when it comes to online shopping. Influencing experiences, airline recommendations, weather forecasts, and more. Small-scale data is data that is accessible, actionable, and easily understood by humans. Data scientists often leverage small-scale data to analyze current situations. The growth of small data in machine learning (ML) can generally be attributed to increased data availability and experimentation with new data mining techniques. As the AI industry evolves, data scientists are increasingly turning to small data for the lower levels of computing power required and ease of use.

Small-scale data is data that is accessible, actionable, and easily understood by humans. Data scientists often leverage small-scale data to analyze current situations.

small data and big data

How exactly is big data different from small data? Big data consists of large chunks of both structured and unstructured data. Given its size, it is much more difficult to understand and analyze than small data, and requires more computer processing power to interpret. Small data allows businesses to gain actionable insights without the need for complex algorithms required for big data analysis. As a result, companies no longer need to invest as much in data mining processes. Big data can be transformed into small data through the application of computer algorithms that change the data into small actionable chunks that represent components of a larger dataset. An example of converting big data to small data is social media monitoring during a brand launch. There are tons of social media posts being created at any given time. Data scientists must filter the required data by platform, time period, keywords, and other relevant features. This process transforms big data into smaller, more manageable chunks from which insights can be extracted.

Advantages of small data

We have already hinted at the advantages of using small data compared to big data, but there are a few that are worth highlighting.Managing big data becomes difficult. Using big data at scale is labor-intensive, and analysis requires significant computer power.Smaller data is easier. Analysis of small chunks of data can be done very efficiently without spending much time and effort. This means that small data is more actionable than big data.Small data exists everywhere. Small-scale data is already widely used in many industries. For example, social media provides a large amount of actionable data that can be used for various purposes such as marketing.Small data focuses on the end user. When data is scarce, researchers can target end users and their needs first. Small-scale data provides the reason behind end-user behavior. For many use cases, small-scale data is a fast and efficient approach to analysis that can help you gain powerful insights about your customers across industries.

Approaching small data in ML

In supervised learning, the most traditional machine learning method, models are trained on large amounts of labeled training data. However, there are many other methods for training models, many of which are becoming more popular due to their cost efficiency and time savings. These techniques often rely on small-scale data, where data quality is paramount. Data scientists use small data when a model requires only a small amount of data, or when the model does not have enough data. In such cases, a data scientist can use one of his ML techniques:

Few shot learning

Using few-shot learning, data scientists provide an ML model with a small amount of training data. This approach is common in computer vision, where the model may not need many samples to identify an object. For example, if you have a facial recognition algorithm that unlocks your smartphone, you don't need to take thousands of photos with your phone to enable it. Adding security features only requires adding a few features. This technique is low-cost and low-effort, making it attractive when you don't have enough data to train a model with fully supervised learning.

knowledge graph

Knowledge graphs are secondary datasets because they are formed by filtering the original, larger data. They consist of a set of data points or labels that have a defined meaning and describe a particular domain. For example, a knowledge graph can include data points for the names of famous actresses and lines (known as edges) that connect them to actresses with whom she has worked before. Knowledge graphs are extremely useful tools for organizing knowledge in a highly explainable and reusable way.

transfer learning

Transfer learning is when an ML model is used as a starting point for another model that needs to perform a related task. This is basically the transfer of knowledge from one model to another. Starting with the original model, you can use additional data to further train the model to handle new tasks. You can also delete components from the original model if they are not needed for the new task. Transfer learning is particularly useful in fields such as natural language processing and computer vision, which require large amounts of computational power and data. If this method is feasible, it will be a shortcut to achieving results with less effort.

self-supervised learning

The idea behind self-supervised learning is that the model collects monitoring signals from the available data. The model uses available data to predict unobserved or hidden data. For example, in natural language processing, a data scientist can feed a model a sentence with missing words and have the model predict which words are missing. Once the unhidden words provide enough context clues, the model learns to identify the remaining words.

synthetic data

Synthetic data can be leveraged when a particular dataset has gaps that are difficult to fill with existing data. A common example is with facial recognition models. These models require facial image data that covers the full range of human skin tones. The problem is that images of people with darker skin tones are rarer than images of people with lighter skin tones. Rather than creating models that have difficulty identifying dark-skinned people, data scientists can instead create data for darker-skinned faces artificially to achieve equal representation. However, machine learning professionals should test these models more thoroughly in the real world and plan to add additional training data when computer-generated datasets are insufficient. The approaches listed here are not an exhaustive list, but they give a promising overview. Machine learning is moving in many directions. Data scientists are generally moving away from supervised learning and instead experimenting with approaches that rely on small-scale data.

Expert insights from Rahul Parundekar, Director of Data Science

It's important to clarify what we don't mean by “small” data. small amount data. This means the right kind of data needed to create models that generate business insights or automate decisions. I often see people over-promised about what AI can deliver, sharing a few images and expecting a production-quality model, but that's not what we're talking about here. We're talking about finding the best data to create a model that will give you the right output you need when you actually deploy it. There are several things to keep in mind when creating a “small” dataset.

Data relevance

Make conscious choices about what data you include in your dataset. You need to make sure that it only contains the type of data that you will see when you actually use the model (i.e. in production). For example, if you are performing defect detection on a manufacturing conveyor line for one type of manufacturing part at a time, the data in your set will be images taken by a camera attached to the line for that part. Image of an empty conveyor with no defects and objects.

Data diversity and repetition

It is important to cover all the different cases of data that the model actually references, and to maintain a good balance of diversity within those cases. Avoid overcrowding the dataset with data that is already covered. Defect detection examples include reliably capturing objects without defects, objects with different types of defects, different lighting conditions on the factory floor, different rotations and positions on the belt, and even several defects. is needed. Example of maintenance mode. A defect-free product is just like any other defect-free product, so there is no need to overfill it. Another example of unnecessary repetition is video frames with little or no change.

Build with robust technology

The small data approach described above is a great place to start. Perhaps you can benefit from doing transfer learning on another model in a similar domain that you have already trained with good results and then adjusting it on smaller data. In the defect detection example, this would probably be another previously trained defect detection model, rather than fine-tuning a model trained on the MS COCO dataset. This is different from the use case of defect detection on a conveyor line.

Data-centric AI vs. model-centric AI

The latest learnings from the AI industry show that finding the right data to train on has a much greater impact on model performance. By finding edge cases and variations, you can do more by finding edge cases and variations than by training with multiple hyperparameters, different model architectures, or generally just assuming that a competent data scientist will “get it.” You will get good results. If your defect detection model is having trouble detecting a particular type of defect, invest more in acquiring more images of that type instead of trying a different model architecture or hyperparameter tuning.

Work with training data experts:

Data-centric AI also requires debugging efforts to be focused on data, where domain experts are good, rather than models, where data scientists are good. Work with domain experts to identify patterns of cases in which your model fails and hypothesize why your model fails. This will help you determine the appropriate data you need to retrieve. For example, an engineering expert who specializes in object defects can help prioritize the right data needed for a model, clean the noisy or unnecessary data mentioned above, or help a data scientist better They can also point out nuances you might use to choose a model architecture. , think of small data as being more “dense” than big data. You want to get the highest quality data with the smallest possible dataset size, be cost-effective, and be able to easily create a “champion” model with any of the above approaches.

what we can do for you

Appen provides data collection and annotation services on its platform to improve machine learning at scale. As a global leader in our field, our clients can quickly deliver large amounts of high-quality data across multiple data types, including images, video, voice, audio, and text, tailored to their specific AI program needs. Benefit from our features. We offer multiple data solutions and services to best fit your needs. With over 25 years of expertise, we work with you to optimize the efficiency of your data pipeline. If you would like to discuss your training data needs, please contact us.

Source link