Nirupam Gupta
{first_name}.{last_name}
@georgetown.edu

Home Page

Google scholar profile

Fault-Tolerance in Distributed Optimization,
and Machine Learning

The problem of distributed (or collaborative) optimization in multi-agent systems has gained significant attention in recent years. In this problem, we have mutliple agents in the system and each agent knows its own local cost (or loss) function. In the fault-free setting, all the agents are non-faulty (or honest), and the goal is to design a distributed algorithm that enables the agents to compute a minimum of the aggregate of their local cost functions.

As a simple example, the local cost function for an agent (which may be a robot or a person) could be the cost to travel to a desired location from its current location. In that case, the minimum of the aggregate cost functions of all the agents is a location that minimizes the total cost of travel for all the agents. Such multi-agent distributed optimization is of interest in many practical applications, including distributed (or federated) machine learning, swarm robotics, and distributed control systems such as SCADA. Most of the prior work assumes all the agents to be non-faulty. Non-faulty agents follow a specified algorithm correctly. However, in practice some of the agents in the system may be faulty and may behave incorrectly.

We particularly consider a scenarion where some of the agents in the system are Byzantine faulty. The concept of Byzantine faults was first introduced by Leslie Lamport (and colleagues) in his seminal work on The Byzantine Generals Problem in 1982. The Byznatine faulty agents may send incorrect and inconsistent information in order to bias the output of a distributed optimization algorithm, and the faulty agents may also collude to optimize damage to the system. For example, consider an application of distributed optimization to the case of distributed sensing where the agents acting as sensors are observing a common object in order to collaboratively identify the object. However, the faulty agents may send arbitrary observations concocted to prevent the honest agents from making the correct identification. Similarly, in the case of distributed machine learning, which is another application of distributed optimization, the faulty agents may send incorrect information based on mislabelled or arbitrarily concocted data points to prevent the honest agents from learning a good classifier.

The following is the list of our pre-prints and publications.

  1. Byzantine Fault-Tolerant Distributed Machine Learning Using Stochastic Gradient Descent (SGD) and Norm-Based Comparative Gradient Elimination (CGE) (arXiv, Aug'20),
    co-authored with Shuo Liu, and Nitin Vaidya.
  2. Fault-tolerance in Distributed Optimization: The Case of Redundancy (pdf), co-authored with Nitin Vaidya. Presented at the 39th ACM Symposium on Principles of Distributed Computing (PODC'20): YouTube Video.
  3. Resilience in Collaborative Optimization: Redundant and Independent Cost Functions (arXiv, March'20), co-authored with Nitin Vaidya.
  4. Randomized Reactive Redundancy for Byzantine Fault-Tolerance in Parallelized Learning (arXiv, Dec'19), co-authored with Nitin Vaidya.
  5. Byzantine Fault-Tolerant Parallelized Stochastic Gradient Descent for Linear Regression (pdf), co-authored with Nitin Vaidya. Presented at the 2019 Allerton Conference at University of Illinois at Urbana-Champaign.
  6. Byzantine Fault Tolerant Distributed Linear Regression (arXiv, March'19).
    co-authored with Nitin Vaidya.