Federated learning is a term coined by Google in 2017 to describe a distributed methodology for training an AI or machine learning (ML) model. Federated learning doesn’t apply to a particular type of AI or ML model, or even define a specific technical implementation. Rather, federated learning is a conceptual framework for distributing the training of a machine learning model across many endpoints and data sets (versus centralized training on a massive dataset) in a way that inherently creates strong data privacy protections.
How Does Federated Learning Work?
In federated learning, many individual devices — e.g., smartphones — or entities, like a bank branch, hospital, or factory, contribute to the training of a “central” ML model, but they do so in a way that shares none of the data used for training directly with that central model. Instead, each device or entity downloads the central ML model from a shared repository, then uses its own data locally to train that model incrementally. Once a round of incremental training is completed, the device or entity uploads the now-retrained model to the shared repository. The central model is trained continuously on the retrained models that are uploaded by the individual devices or entities. The end devices or entities then receive updated versions of the central model as it evolves, and this process continues indefinitely in a circular fashion.
In traditional ML training, end devices or entities upload their data, which is then added to the master data set used to train an ML model. In this scenario, the raw data must be shared with whichever entity is responsible for training the model all devices use. This raises privacy and security concerns and may be illegal for certain use cases, like those involving medical data or financial transaction data.
You could think of federated learning a bit like licensed goods. If you license production of your breakfast cereal “AlgorithmOs” to many client distributors with their own local factories, you probably don’t know who’s buying from them, what their profit margin is, or how they manage inventory — the proprietary business information your clients wish to keep private. However, you may ask that your clients send you production samples on a regular basis, which you then analyze. Using the aggregate data from all clients, you then publish an updated recipe and processes for manufacturing, resulting in increased quality, efficiency, and customer satisfaction for all clients — without handling or exposing the sensitive business data of any one customer to another. In this sense, federated learning produces a better outcome for all users of a given ML model, by aggregating training from all said users, without exposing any of the sensitive data used for training.
What are Examples of Federated Learning?
One clear use case for federated learning is in medical research, particularly for diagnostic medicine. The identification of a pathology which can lead to the diagnosis of a condition — for example, a cancerous tumor — generally requires a trained physician and possibly the concurring opinion of several doctors. Doctors with skill in diagnosing cancer are, on the grand scale, very rare! (I.e., there is far more capacity to treat cancers than there is capacity to diagnose the cancers to be treated.) Training a machine learning model to identify tumors, therefore, is a critical concern for scaling the availability of cancer diagnosis. This would require training a model on many, many images of suspected tumors and non-tumors, far more than any hospital or hospital system is likely to have. Sharing such sensitive medical data outside a healthcare provider’s network is ethically and legally challenging. Federated learning offers a potential pathway: By training a model inside a healthcare provider’s owned IT infrastructure, no patient data ever leaves that infrastructure. The only thing that is transmitted is the retrained model, which contains no actual data — just the enhancements in accuracy and precision the training has yielded. Scaled to a large number of hospitals and other medical providers, this could lead to a central model performant enough to diagnose certain cancerous conditions, an outcome that would otherwise be impossible.
Another use case could be fraud detection in credit card transactions. Card issues and banks must protect the privacy of their transaction data, but if they’re able to train a model for fraud detection on their own systems, that transaction data is kept secure. In aggregate, their retraining of a central model could lead to lower incidences of fraud across all credit card transactions, increasing profitability and (hopefully) customer satisfaction for all vendors. Federated learning, in its best form, enables these kinds of “win-win” outcomes.
Types of Federated Learning
Federated learning comes in various forms, as well—there are three effective models a federated learning system can be architected under, each with its own benefits.
In horizontal federation — which is used in the tumor and the credit card examples above — many entities train a central model across data sets of the same type to refine that model’s performance on a specific workload. Individual entities train the central model on their own data set, send the refined model up to the central model server, and that refined model’s changes are averaged into a resulting new version of the central model. This new version of the central model is then sent down to the individual entities, who start the training process over. Horizontal federation is all about getting the model training benefits of a much larger shared data set without having to actually share the data used for training directly (and the security and legal concerns that come with such sharing).
In vertical federation, a machine learning model is trained between two or more data sets that have overlapping user data, typically across two different but related data types. For example, a vertically federated model could be trained on the purchase history of customers of an online retailer, users whose anonymized IDs overlap with an advertiser’s data set that contains information about their social media and browsing history. Together, the store and advertiser could predict which products to advertise to their users while excluding products they’ve already purchased, while never exposing any customer data directly to one another. Vertical federation is, in effect, about leveraging complementary data sets with large user overlap to achieve an outcome neither data set could on its own.
Finally, federated transfer learning is a concept for taking a model trained on a large data set and user base and applying it to a second, smaller data set and user base (or even a single user’s data). Conceptually, federated transfer learning (FTL) is a good fit when one entity with a large, well-trained model wants to share that model with an entity that has a much smaller data set of a similar type (that is, a data set that is not large enough to train a model on its own). FTL can be useful in scenarios where training a model to a single individual’s data set is necessary; for example, in training a fitness device to understand a user’s typical vital signals versus abnormal vital signals. The device needs some kind of “baseline” model to work from, but that model must then be trained against the individual using the device for the best performance.
Horizontal learning would actually work against this outcome because the retrained model using the individual’s data would be averaged against the many other retrained models contributing to the central model — resulting in a model that was still very close to the “baseline,” versus personalized to a specific user’s data. This also preserves the privacy of both parties — the parent model’s data is never exposed, nor is the second party’s individualized data.
Why Isn’t Federated Learning More Popular?
In theory, federated learning would allow many corporations or other large entities to work together on ML model training without exposing data to one another—something of a “win-win” scenario. In practice, however, a number of challenges with federated learning make this outcome difficult to realize.
The biggest issue is trust. Federated learning — especially horizontal federation — requires all participants in model training to trust that no one is going to “poison” the model with bad data, causing model regression. While malicious intent isn’t very likely when each participant starts at a high level of trust — for example, our hospital and tumor identification use case — what’s harder is establishing the ongoing quality of every participant’s data. If one contributor to the training has an extremely skewed or flawed data set, it could poison the model for everyone.
The next biggest challenge is data sharing. Even sharing the result of a model training outside an organization is often going against corporate data privacy policies, requiring lengthy and complex reviews to implement federated learning as a practice. It also gets back to trust: Most entities are unlikely to trust other model contributors unless they can examine the quality of the data on which they are training the model. While federated learning by design eliminates the need to transfer the data used for training itself outside of an organization or entity’s control, that doesn’t mean other contributors won’t want to be able to validate and audit that data as part of the model training process. Getting corporations and other large organizations to agree to such reciprocal data visibility may be possible (versus data sharing, which may be impossible), it’s still a huge practical barrier to overcome without a major incentive.
These concerns have limited the use of federated learning to date, though horizontal federation has been deployed in some well-known cases like Google’s Gboard app for Android smartphones. As the need for new data to train ML models rapidly grows, federated learning could conceivably become more attractive — as publicly available data sets no longer meaningfully improve models, the market for private training data will grow. Federation offers a path for organizations to train on their private data without exposing the data itself, though, as we discussed above, there are still significant practical challenges to consider in implementation to ensure trust and quality in the outcome.