Loss
Moving Lands
This gallery showcases a few examples of the video pieces produced by the LL project. Phase 2 is currently ongoing. Writings that go deep into these and other visualizations are being prepared. Read one of the latest articles published, “Loss landscapes and the blessing of dimensionality” in medium.
DROP visualizes changes produced in the loss landscape as the dropout hyperparameter is gradually increased. Loss Landscape generated with real data: Convnet, imagenette dataset, sgd-adam, bs=16, bn, lr sched, train mod, 250k pts, 20p-interp, log scaled (orig loss nums) & vis-adapted, When analyzing the loss landscape generated while increasing dropout, we see a noise layer gradually taking over the landscape, a layer that is disruptive enough to help in preventing overfitting and the memorization of paths and routes across the landscape, and yet not disruptive enough to prevent convergence to a good minima (unless dropout is taken to extreme values).
CROWN | Comparison study between the loss landscapes of the ReLU, Mish and Swish activation functions during the 200th epoch of the training of a Resnet 20 network. Resnet 20 | BS=128 | LR Sched | Mom=0.9 | wd=1e-4 | Eval mode. A colaboration by Diganta misra, Ajay uppili arasanipalai, Trikay nalamada and javier ideami, as part of the Landskape deep learning research group projects.
LL Library visualizes a concept prototype for a library of loss landscapes. The loss landscapes featured are created with real data, using Resnet 20 arquitectures, with batch sizes of 16 and 128 and the Adam optimizer. This is part of an ongoing project.
LOTTERY visualizes the performance of a Resnet18 (Mnist dataset) as the weights of the network are gradually being pruned (based on arxiv:1803.03635 by jonathan frankle, michael carbin). Up to 80% pruning it can be observed in this specific network that the performance of the retrained networks with pruned weights can equal or exceed the original one when evaluating the test dataset. Loss Landscape generated with real data: resnet18 / mnist, sgd-adam, bs=60, lr sched, eval mod, log scaled (orig loss nums) & vis-adapted.
ICARUS Mode Connectivity. Optima of complex loss functions connected by simple curves over which training and test accuracy are nearly constant. This visualization uses real data and shows the training process that connects two optima through a pathway generated with a bezier curve. To create ICARUS, 15 GPUs were used over more than 2 weeks to produce over 50 million loss values. The entire process end to end took over 4 weeks of work.
Icarus takes its name from greek mythology. Let’s see what Wikipedia says about it: “In Greek mythology, Icarus is the son of the master craftsman Daedalus, the creator of the Labyrinth. Icarus and his father attempt to escape from Crete by means of wings that his father constructed from feathers and wax.”. Now we can establish an analogy. The loss landscape is like another labyrinth where our objective, our escape is to find a low enough valley, the target of our optimizer. Yet this is no ordinary labyrinth. The loss landscape is highly dimensional and unlike in typical labyrinths, in a loss landscape it is indeed possible to find other ways, shortcuts that can link some of those optima. So just as Daedalus and Icarus use special wings to escape Crete, the authors of the paper combine simple curves (in this specific video, a bezier curve) and their custom training process to escape the isolation between the optima, demonstrating that even though straight lines connecting the optima must go through hills with very high loss values, there are other pathways, other ways to connect the optima, through which training and test accuracy remain nearly constant. In addition, the morphology of the connected optima represented in this video, resembles a set of wings. These wings come to life within the different strategies applied by these modern “Icarus” like scientists as they pursue new ways to escape the isolation of the optima found in these kinds of loss landscapes.
Visualization data generated through a collaboration between Pavel Izmailov (@Pavel_Izmailov), Timur Garipov (@tim_garipov) and Javier Ideami (@ideami). Based on the NeurIPS 2018 paper by Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, Andrew Gordon Wilson: https://arxiv.org/abs/1802.10026 | Creative visualization and artwork produced by Javier Ideami.
DRONE. This project is ongoing and new updates will be posted later on. Real Data visualization of a geometric convnet | Created together with Neural Concept SA | Some of the parameters of the project: L2 Loss, ADAM, BS=1, LR=0.0001 / 24588 Vertices. The video shows the activations of the convolutional layers inside a Geometric CNN extracting features from the surface of a drone, while the network is being trained to predict aerodynamic properties of the aircraft. The numeric field values are what the network is being trained to predict. they represent the pressure exerted by the air on the drone. The colors over the drone are the features that the geometric convnet is extracting in order to predict those air pressure values.
LATENT visualizes the initial stages of the training of a Wasserstein GP GAN network, trained over the celebA dataset. The first part of the video shows the first 1K steps of the training, and the final part shows the steps from 10K to 11K. The middle part shows part of the loss landscape of the Generator after the first 820 training steps. The morphology and dynamics of the generator’s landscape around the minimizer are very diverse and change quickly, expressing the complexity of the generator’s task and the array of possible routes existing ahead. Loss Landscape generated with real data: wasserstein GP Gan, celebA dataset, sgd-adam, bs=64, train mod, 300k pts, 1 w range, latent space dimensions: 200, generator is sometimes reversed for visual purposes, critic is log scaled (orig loss nums) & vis-adapted.
LR COASTER visualizes a learning rate stress test during the training of a convnet. We ride along the minimizer while exploring its nearby surroundings. I use extreme changes in the learning rate to illustrate how the morphology and dynamics of the loss landscape change in response to the changes in the learning rate. The resolution (300K loss values calculated per frame) allows us to explore the change in morphology. More details and related analysis about this and other visualizations will be published in the future.
Real Data visualizations using PCA directions. From the authors of the paper: “Machine learning models are used to make decisions, and representing uncertainty is crucial for decision making, especially in safety-critical applications. Deep learning models trained by minimizing the loss on the train dataset tend to provide overconfident and miscalibrated predictions because they ignore uncertainty over the parameters of the model. In Bayesian machine learning we account for this uncertainty: we form a distribution over the weights of the model, known as posterior. This distribution captures different models that all explain train data well, but provide different predictions on the test data. For Neural networks the posterior distribution is very complex: there is no way to compute it exactly and we have to approximate it. A key challenge for approximate inference methods is to capture the geometry of the posterior distribution or, equivalently, the loss landscape.
The idea of our SWAG is to extract the information about the posterior geometry from the SGD trajectory. We start by pre-training a Neural Network with SGD, Adam or any other optimizer, to get a good initial solution. This part is the same as the standard training of the model. Starting from the pre-trained solution, we run SGD with a high constant learning rate. In this setting instead of converging to a single solution, SGD would bounce around different solutions that all explain the train data well. We then construct a Gaussian distribution that captures these different solutions traversed by SGD, and use it as our approximation to the posterior. It turns out that this simple procedure captures the local Geometry of the posterior remarkably well.”
Based on the paper by wesley maddox, timur garipov, pavel izmailov, dmitry vetrov, andrew gordon wilson. Visualization is a collaboration between pavel izmailov, timur garipov and javier ideami@losslandscape.com. NeurIPS 2019, ARXIV:1902.02476 | losslandscape.com..
SENTINEL visualizes the optimization process of a convnet during training mode, moving from a high loss value through the creation of an edge horizon to the final convexity and minimum. We ride along the minimizer while exploring its nearby surroundings. More details and related analysis about this and other visualizations will be published in the future.
WALTZ-RES visualizes the difference in morphology and dynamics between two resnet-25 networks, one with skip connections and one without. In this fragment of the visualization, we can see the first 2 and a half epochs of the training process. We ride along the minimizer while exploring its nearby surroundings. More details and related analysis about this and other visualizations will be published in the future.
EDGE HORIZON visualizes a loss landscape in extreme resolution, using 1 million loss points captured during the training of a convnet. The morphology of the landscape during the training phase is influenced by the parameters of the network. More details and related analysis about this and other visualizations will be published in the future.
GOBLIN takes us on a journey from above the edge horizon of the loss landscape of a convnet, during its training process, through the edge horizon (laterally) and to the perspective from below its dynamic convexity. We ride along the minimizer while exploring its nearby surroundings. More details and related analysis about this and other visualizations will be published in the future.
DOWN UNDER goes deep below the loss landscape of the training process of a convnet (while training mode is active), giving us a perspective from below, as the minimizer’s dynamics transform the nearby surroundings during its journey towards its final destination. We ride along the minimizer while exploring its nearby surroundings. More details and related analysis about this and other visualizations will be published in the future.
GENTLY follows the gentle change in the surroundings of the minimizer as we follow its gradual descent. We ride along the minimizer while exploring its nearby surroundings. More details and related analysis about this and other visualizations will be published in the future.
CONVEXITY DYNAMICS Loss landscape convexity dynamics. Perspective from below
MORPHOLOGY STUDY Loss landscape morphology study
Preparation phase
The gallery above will be expanded with more creations and associated writings over time. Before I began creating my own landscapes, there was a preparation phase in which I worked with existing data from other sources. An example of that phase is the landscape right below (which uses data from the excellent paper “Visualizing the Loss Landscape of Neural Nets” by Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer and Tom Goldstein. I also produced simulations like the last video of the gallery, which is a simulation that was created before the Loss Landscape project began. All the Loss Landscape videos use real data and real networks except the very last one of this page, which was also the very first loss landscape video I created.
LL is led by Javier Ideami, A.I researcher, multidisciplinary creative director, engineer and entrepreneur. Contact Ideami on ideami@ideami.com
xyz