Federated learning

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Federated learning designates a set of techniques striving to simultaneously train a single machine learning algorithm across multiple decentralized servers holding local data samples, without exchanging their data samples. This approach stands in contrast to traditional centralized machine learning techniques where all data samples are uploaded to one server, as well as to more classical decentralized approaches which assume that local data samples are identically distributed.

Federated learning enables multiple actors to build a common, robust machine learning model without sharing data samples, thus addressing critical issues such as data privacy, data security, data access rights and access to heterogeneous data. Its applications are spread over a number of industries including defense, telecommunications, IoT, or pharmaceutics.


Federated learning aims at training a machine learning algorithm, for instance deep neural networks, on multiple local datasets contained in local nodes without exchanging data samples. The general principle consists in training local models on local data samples and exchanging parameters (e.g. the weights of a deep neural network) between these local models at some frequency to generate a global model.

Federated learning algorithms may use a central server that orchestrates the different steps of the algorithm and acts as a reference clock, or they may be peer-to-peer, where no such central server exists. In the non peer-to-peer case, a federated learning process can be broken down in multiple rounds, each consisting of 4 general steps.

Federated learning general process in central orchestrator setup
Federated learning general process in central orchestrator setup

The main difference between federated learning and distributed learning lies in the assumptions made on the properties of the local datasets[1], as distributed learning originally aims at parallelizing computing power where federated learning originally aims at training on heterogeneous datasets. While distributed learning also aims at training a single model on multiple servers, a common underlying assumption is that the local datasets are identically distributed and roughly have the same size. None of these hypotheses are made for federated learning; instead, the datasets are typically heterogeneous and their sizes may span several orders of magnitude.

Federated learning important features[edit]

Iterative learning[edit]

To ensure good task performance of a final, central machine learning model, federated learning relies on an iterative process broken up into an atomic set of client-server interactions known as a federated learning round. Each round of this process consists in transmitting the current global model state to participating nodes, training local models on these local nodes to produce a set of potential model updates at each node, and then aggregating and processing these local updates into a single global update and applying it to the global model.

In the methodology below, we use a central server for this aggregation, while local nodes perform local training depending on the central server’s orders. However, other strategies lead to the same results without central servers, in a peer-to-peer approach, using gossip methodologies[2].


A statistical model (e.g., linear regression, neural network, boosting) is chosen to be trained on local nodes and initialized. Nodes are activated and wait for the central server to give calculation tasks.

Iterative training[edit]

For multiple iterations of so-called federated learning rounds, the following steps are performed[3]:


A fraction of local nodes are selected to start training on local data. They all acquire the same current statistical model from the central server. Other nodes wait for the next federated round.


The central server orders selected nodes to undergo training of the model on their local data in a pre-specified fashion (e.g. for some batch updates of gradient descent).


Each node returns the locally-learned incremental model updates to the central server. The central server aggregates all results and stores the new model. It also handles failures (e.g., connection lost with a node while training). The system returns to the selection phase.

Federated Learning Protocol (Towards federated learning at scale: system design, Keith Bonawitz, Hubert Eichner and al., 2019)
Federated Learning Protocol (Towards federated learning at scale: system design, Keith Bonawitz, Hubert Eichner and al., 2019)


When a pre-specified termination criterion (e.g. maximal number of rounds or local accuracies higher than some target) has been met, the central server orders the end of the iterative training process. The central server contains a robust model which was trained on multiple heterogeneous data sources.

Algorithmic hyper-parameters[edit]

Network topology[edit]

The way the statistical local outputs are pooled and the way the nodes communicate with each other can change from the centralized model explained in the previous section. This leads to a variety of federated learning approaches: for instance no central orchestrating server, or stochastic communication[4].

In particular, orchestrator-less distributed networks are one important variation. In this case, there is no central server dispatching queries to local nodes and aggregating local models. Each local node sends its outputs to a several randomly-selected others[5], which aggregate their results locally. This restrains the number of transactions, thereby sometimes reducing training time and computing cost.

Federated learning parameters[edit]

Once the topology of the node network is chosen, one can control different parameters of the federated learning process (in opposition to the machine learning model’s own hyperparameters) to optimize learning :

  • Number of federated learning rounds : T
  • Total number of nodes used in the process : K
  • Fraction of nodes used at each iteration for each node : C
  • Local batch size used at each learning iteration : B

Other model-dependent parameters can also be tinkered with, such as :

  • Number of iterations for local training before pooling : N
  • Local learning rate : η

Those parameters have to be optimized depending on the constraints of the machine learning application (e.g., available computing power, available memory, bandwidth). For instance, stochastically choosing a limited fraction C of nodes for each iteration diminishes computing cost and may prevent overfitting, in the same way that stochastic gradient descent can reduce overfitting.

Federated learning variations[edit]

In this section, we follow the exposition of Communication-Efficient Learning of Deep Networks from Decentralized Data, H. Brendan McMahan and al. 2017.

To describe the federated strategies, let us introduce some notations:

  • nk: number of data samples available during training for client k;
  • wkt: model’s weight vector on client k, at the federated round t;
  • l(w, b): loss function for weights w and batch b;
  • K : total number of clients;
  • k : index of clients;
  • E : number of local epochs;

Federated Stochastic Gradient Descent (FedSGD)[edit]

Deep learning training mainly relies on variants of stochastic gradient descent, where gradients are computed on a random subset of the total dataset and then used to make one step of the gradient descent.

Federated stochastic gradient descent[6] is the direct transposition of this algorithm to the federated setting, but by using a random fraction C of the nodes and using all the data on this node. The gradients are averaged by the server proportionally to the number of training samples on each node, and used to make a gradient descent step.

Federative averaging[edit]

Federative averaging (FedAvg)[7] is a generalization of FedSGD, which allows local nodes to perform more than one batch update on local data and exchanges the updated weights rather than the gradients. The rationale behind this generalization is that in FedSGD, if all local nodes start from the same initialization, averaging the gradients is strictly equivalent to averaging the weights themselves. Further, averaging tuned weights coming from the same initialization does not necessarily hurt the resulting averaged model’s performance.

The algorithm can be written as follows :

Federated Averaging algorithm (Communication-Efficient Learning of Deep Networks from Decentralized Data, H. Brendan McMahan and al., 2017)
Federated Averaging algorithm (Communication-Efficient Learning of Deep Networks from Decentralized Data, H. Brendan McMahan and al., 2017)

Technical limitations[edit]

Federated learning requires frequent communication between nodes during the learning process. Thus, it requires not only enough local computing power and memory, but also high bandwidth connections to be able to exchange parameters of the machine learning model. However, the technology also avoid data communication, which can require significant resources before starting centralized machine learning.

Federated learning raises several statistical challenges :

  • Heterogeneity between the different local datasets: each node may have some bias with respect to the general population, and the size of the datasets may vary significantly;
  • Temporal heterogeneity: each local dataset’s distribution may vary with time;
  • Each node’s dataset may require regular curations.

Properties of federated learning[edit]

Privacy by design[edit]

The main advantage of using federated approaches to machine learning is to ensure data privacy or data secrecy. Indeed, no local data is uploaded externally, concatenated of exchanged. Since the entire database is segmented into local bits, this makes it more difficult to hack into it.

With federated learning, only machine learning parameters are exchanged. In addition, such parameters can be encrypted before sharing between learning rounds to extend privacy. Despite such protective measures, these parameters mays still leak information about the underlying data samples, for instance, by making multiple specific queries on specific datasets. Querying capability of nodes thus is a major attention point, which can be addressed using differential privacy or secure aggregation[8].


The generated model delivers insights based on the global patterns of nodes. However, if a participating node wishes to learn from global patterns but also adapt outcomes to its peculiar status, the federated learning methodology can be adapted to generate two models at once in a multi-task learning framework.

In the case of deep neural networks, it is possible to share some layers across the different nodes and keep some of them on each local node. Typically, first layers performing general pattern recognition are shared and trained all datasets. The last layers will remain on each local node and only be trained on the local node’s dataset.

Legal upsides of federative learning[edit]

Western legal frameworks emphasize more and more on data protection and data traceability. White House 2012 Report[9] recommended the application of a data minimization principle, which is mentioned in European GDPR[10]. In some cases, it is impossible to transfer data from a country to another (e.g., genomic data), however international consortia are sometimes necessary for scientific advances. In such cases federated learning brings solutions to train a global model while respecting security constraints.

Current research topics[edit]

Federated learning has started to emerge as an important research topic in 2015[11] and 2016[12], with the first publications on federative averaging in telecommunication settings. Recent publications have emphasized the development of resource allocation strategies, especially to reduce communication[13] requirements[14] between node with gossip algorithms[15]. In addition, recent publications continue to work on the federated algorithms robustness to differential privacy attacks[16].

Use cases[edit]

Federated learning typically applies when individual actors need to train models on larger datasets than their own, but cannot afford to share the data in itself with other (e.g., for legal, strategic or economic reasons). The technology yet requires good connections between local servers and minimum computational power for each node.

Telecom : Predictive keyboard (Google’s G-board)[edit]

One of the historical use cases of federated learning has been implemented by Google[3] for predictive keyboards. Under high regulatory pressure, it showed impossible to upload every user’s text message to train the predictive algorithm for word guessing. Besides, such a process would hijack too much of the user’s data. Despite the sometimes limited memory and computing power of smartphones, Google has made a compelling use case out of its G-board, as presented during the google IO 2019 event.

Healthcare : Federated datasets from hospitals[edit]

Pharmaceutical research is pivoting towards a new paradigm : real world data use for generating drug leads and synthetic control arms. Generating knowledge on complex biological problems require to gather a lot of data from diverse medical institutions, which are eager to maintain control of their sensitive patient data. Federated learning, especially assisted by high traceability technologies (distributive ledgers) enable researchers to train predictive models on many sensitive data in a transparent way without uploading them. In 2019, French start-up Owkin is pioneering the development of biomedical machine learning models based on such algorithms to capture heterogeneous data from both pharmaceutical companies and medical institutions.

Transport industry : Self-driving cars[edit]

Self driving cars encapsulate many machine learning technologies to function : computer vision for analyzing obstacles, machine learning for adapting their pace to the environment (e.g., bumpiness of the road). Due to the potential high number of self-driving cars and the need for them to quickly respond to real world situations, traditional cloud approach may generate safety risks. Federated learning can represent a solution for limiting volume of data transfer and accelerating learning processes.


  1. ^ Federated Optimization: Distributed Optimization Beyond the Datacenter, Jakub Konecny, H. Brendan McMahan, Daniel Ramage, 2015
  2. ^ Decentralized Collaborative Learning of Personalized Models over Networks Paul Vanhaesebrouck, Aurélien Bellet, Marc Tommasi, 2017
  3. ^ a b Towards federated learning at scale: system design, Keith Bonawitz Hubert Eichner and al., 2019
  4. ^ Collaborative Deep Learning in Fixed Topology Networks, Zhanhong Jiang, Aditya Balu, Chinmay Hegde, Soumik Sarkar, 2017
  5. ^ GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent, Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, Vinay Amatya, 2018
  6. ^ Communication-Efficient Learning of Deep Networks from Decentralized Data, H. Brendan McMahan and al. 2017
  7. ^ Communication-Efficient Learning of Deep Networks from Decentralized Data, H. Brendan McMahan and al. 2017
  8. ^ Practical Secure Aggregation for Privacy Preserving Machine Learning, Keith Bonawitz, 2018
  9. ^ Consumer data privacy in a networked world: A framework for protecting privacy and promoting innovation in the global digital economy. Journal of Privacy and Confidentiality, 2013
  10. ^ Recital 39 of the Regulation (EU) 2016/679 (General Data Protection Regulation)
  11. ^ Federated Optimization: Distributed Optimization Beyond the Datacenter, Jakub Konecny, H. Brendan McMahan, Daniel Ramage, 2015
  12. ^ Federated Optimization: Distributed Machine Learning for On-Device Intelligence, Jakub Konecny and al., 2016
  13. ^ Communication-Efficient Learning of Deep Networks from Decentralized Data H. Brendan McMahan, 2017
  14. ^ Federated Learning: Strategies for Improving Communication Efficiency, Jakub Konečný, H. Brendan McMahan and al., 2016
  15. ^ Gossip training for deep learning, Michael Blot and al., 2017
  16. ^ Differentially Private Federated Learning: A Client Level Perspective Robin C. Geyer and al., 2018