MultiRobustBench is a standardized benchmark for evaluating adversarial robustness against multiple attacks. MultiRobustBench currently evaluates and ranks models based on performance on a set of 9 different attacks (L1, L2, Linf, Elastic, L1 JPEG, Linf JPEG, ReColor, StAdv, and LPIPS) at 20 different attack strengths. We provide 2 leaderboards for the CIFAR-10 dataset: one with rankings based on average competitiveness ratio (CR_ind-avg in paper) for measuring average multiattack robustness and the other with rankings based on worst-case CR (CR_ind-avg in paper) for measuring worst-case multiattack robustness. Users can toggle between these 2 leaderboards via the "Leaderboard selection" menu. Our leaderboards also report stability constant (SC) computed on this set of attacks. We note that higher CR indicates better performance while lower SC indicates better performance (although SC is best used only when comparing defenses which use the same training threat model).

MultiRobustBench offers the following additional features:

User control of attack evaluation set: Under the "Attacks to use for metric computation" menu, users can select the set of attacks that they are interested in seeing performance metrics for. The attacks used for leaderboard rankings are by default selected. Changing the set of selected attacks and pressing the "Refresh Leaderboard + Graphs" button will update the scores present on the leaderboard with the scores computed on the selected set. We note that leaderboard rankings are not influenced by this change.
Individual performance visualizations: While aggregate metrics such as CR and SC are useful for ranking performance, it is difficult to understand weaknesses of specific defenses through these aggregate metrics. To this end, we allow users to see several visualizations of performance by pressing the button next to entry of the defense of interest. These visualizations include a plot of the defense accuracy compared to training on each attack individually using adversarial training, defense accuracy as perturbation size increases for a selected attack type, a comparison of CR-in (CR computed on seen attacks) and CR-out (CR computed on unseen attacks) scores, and CR computed across each individual attack type.
Performance comparison visualizations: We also allow users to see the above performance visualizations for up to 5 different defenses on the same graph for comparing different defenses. To do this, users can select different defense entries by checking the select box in the rightmost column of the leaderboard. Users can then press the "Compare Selected Defenses" button to see these graphs.
Estimation of training complexity: Existing defenses against multiple attacks are generally more computationally expensive than defenses against individual attacks. Ideally, we would like to achieve to have robustness without significantly increasing the complexity of training. In our leaderboard, we provide an estimate of training complexity in the PetaFLOPs (10¹⁵ FLOPS) column of the table. For each defense, we compute this value via:
(number of FLOPs for a forward pass through the architecture) * ((number of forward passes in 1 epoch of training) + 2 * (number of backwards passes in 1 epoch of training)) * (number of training data points per epoch) * (number of epochs)

Contribute to MultiRobustBench: To add a new defense or attack to the MultiRobustBench leaderboard, please follow the steps present here.