DEDALE: Mathematical Tools to Help Navigate the Big Data Maze

Managing the huge volumes and varying streams of Big Data digital information presents formidable analytical challenges to anyone wanting to make sense of it. Consider the mapping of space, where scientists collect, process and transmit giga-scale data sets to generate accurate visual representations of millions of galaxies. Or consider the vast information being generated by genomics and bioinformatics as genomes are mapped and new drugs discovered. And soon the Internet of Things will bring millions of interconnected information-sensing and transmitting devices.

Improving Weak Lensing Mass Map Reconstructions using Gaussian and Sparsity Priors: Application to DES SV


Authors: N. JeffreyF. B. AbdallaO. LahavF. LanusseJ.-L. Starck, et al
Year: 01/2018
Download: ADS| Arxiv


Mapping the underlying density field, including non-visible dark matter, using weak gravitational lensing measurements is now a standard tool in cosmology. Due to its importance to the science results of current and upcoming surveys, the quality of the convergence reconstruction methods should be well understood. We compare three different mass map reconstruction methods: Kaiser-Squires (KS), Wiener filter, and GLIMPSE. KS is a direct inversion method, taking no account of survey masks or noise. The Wiener filter is well motivated for Gaussian density fields in a Bayesian framework. The GLIMPSE method uses sparsity, with the aim of reconstructing non-linearities in the density field. We compare these methods with a series of tests on the public Dark Energy Survey (DES) Science Verification (SV) data and on realistic DES simulations. The Wiener filter and GLIMPSE methods offer substantial improvement on the standard smoothed KS with a range of metrics. For both the Wiener filter and GLIMPSE convergence reconstructions we present a 12% improvement in Pearson correlation with the underlying truth from simulations. To compare the mapping methods' abilities to find mass peaks, we measure the difference between peak counts from simulated {\Lambda}CDM shear catalogues and catalogues with no mass fluctuations. This is a standard data vector when inferring cosmology from peak statistics. The maximum signal-to-noise value of these peak statistic data vectors was increased by a factor of 3.5 for the Wiener filter and by a factor of 9 using GLIMPSE. With simulations we measure the reconstruction of the harmonic phases, showing that the concentration of the phase residuals is improved 17% by GLIMPSE and 18% by the Wiener filter. We show that the correlation between the reconstructions from data and the foreground redMaPPer clusters is increased 18% by the Wiener filter and 32% by GLIMPSE.

Cosmostat Day on Machine Learning in Astrophysics

Cosmostat Day on Machine Learning in Astrophysics

Date: January the 26th, 2018

Organizer:  Joana Frontera-Pons  <>


Local information

CEA Saclay is around 23 km South of Paris. The astrophysics division (DAp) is located at the CEA site at Orme des Merisiers, which is around 1 km South of the main CEA campus. See for detailed information on how to arrive.

On January the 26th, 2017, we organize the third day on machine learning in astrophysics at DAp, CEA Saclay. 


All talks are taking place at DAp, Salle Galilée (Building 713)

10:00 - 10:45h. Artificial Intelligence: Past, present and future -   Marc Duranton  (CEA Saclay)
10:45 - 11:15h. Astronomical image reconstruction with convolutional neural networks -  Rémi Flamary (Université Nice-Sophia Antipolis)
11:15 - 11:45h. CNN based strong gravitational Lens finder for the Euclid pipeline - Christoph Ernst René Schäfer  (Laboratory of Astrophysics, EPFL)

12:00 - 13:30h. Lunch

13:30 - 14:00h. Optimize training samples for future supernova surveys using Active Learning - Emille Ishida  (Laboratoire de Physique de Clermont)
14:00 - 14:30h. Regularization via proximal methods - Silvia Villa (Politecnico di Milano)                                                            
14:30 - 15:00h. Deep Learning for Physical Processes:  Incorporating Prior Scientific Knowledge - Arthur Pajot (LIP6)
15:00 - 15:30h. Wasserstein dictionary Learning -  Morgan Schmitz  (CEA Saclay - CosmoStat)

15:30 - 16:00h. Coffe break

16:00 - 17:00h. Round table


Artificial Intelligence: Past, present and future

Marc Duranton (CEA Saclay)

There is a high hype today about Deep Learning and its applications. This technology originated from the 50's from a simplification of the observations done by neurophysiologists and vision specialists that tried to understand how the neurons interact with each other and how the brain is structured for vision. This talk will come back to the history of the connectionist approach and will give a quick overview of how it works and of the current applications in various domains. It will also open discussions on how bio-inspiration could lead to a new approach in computing science.

Astronomical image reconstruction with convolutional neural networks

Rémi Flamary (Université Nice-Sophia Antipolis)

State of the art methods in astronomical image reconstruction rely on the resolution of a regularized or constrained optimization problem. 
Solving this problem can be computationally intensive especially with large images. We investigate in this work the use of convolutional 
neural networks for image reconstruction in astronomy. With neural networks, the computationally intensive tasks is the training step, but 
the prediction step has a fixed complexity per pixel, i.e. a linear complexity. Numerical experiments for fixed PSF and varying PSF in large 
field of views show that CNN are computationally efficient and competitive with optimization based methods in addition to being interpretable.

CNN based strong gravitational Lens finder for the Euclid pipeline

Christoph Ernst René Schäfer (Laboratory of Astrophysics, EPFL) 

Within the Euclid survey 10^5 new strong gravitational lenses are expected to be found within 35% of the observable sky. Identifying these objects in a reasonable of time necessitates the development of powerful machine learning based classifiers. One option for the Euclid pipeline are CNN-based classifiers which performed admirably during the last Galaxy-Galaxy Strong Lensing Challenge. This talk will showcase first the potential of CNN for this particular task and second expose some of the issues that CNN still have to overcome.

Optimize training samples for future supernova surveys using Active Learning

 Emille Ishida (Laboratoire de Physique de Clermont)

The full exploitation of the next generation of large scale photometric supernova surveys depends heavily on our ability to provide a reliable early-epoch classification based solely on photometric data. In preparation for this scenario, there has been many attempts to apply different machine learning algorithms to the supernova photometric classification problem. Although different methods present different degree of success, text-book machine learning methods fail to address the crucial issue of lack of representativeness between spectroscopic (training) and photometric (target) samples. In this talk I will show how Active Learning (or optimal experiment design) can be used as a tool for optimizing the construction of spectroscopic samples for classification purposes. I will present results on how the design of spectroscopic samples from the beginning of the survey can achieve optimal classification results with a much lower number of spectra than the current adopted strategy.

Regularization via proximal methods

Silvia Villa (Politecnico di Milano) 

In the context of linear inverse problems, I will discuss iterative regularization methods allowing to consider large classes of data-fit terms and regularizers. In particular, I will investigate regularization properties of first order proximal splitting optimization techniques.  Such methods are appealing since their computational complexity is tailored to the estimation accuracy allowed by the data, as I will show theoretically and numerically.

Deep Learning for Physical Processes:  Incorporating Prior Scientific Knowledge 

Arthur Pajot (LIP6)

We consider the use of Deep Learning methods for modeling complex phenomena like those occurring in natural physical processes. With the large amount of data gathered on these phenomena the data intensive paradigm could begin to challenge more traditional approaches elaborated over the years in fields like maths or physics. However, despite considerable successes in a variety of application domains, the machine learning field is not yet ready to handle the level of complexity required by such problems. Using an example application, namely Sea Surface Temperature Prediction, we show how general background knowledge gained from physics could be used as a guideline for designing efficient Deep Learning models.

Wasserstein dictionary Learning

Morgan Schmitz (CEA Saclay - CosmoStat)

Optimal Transport theory enables the definition of a distance across the set of measures on any given space. This Wasserstein distance naturally accounts for geometric warping between measures (including, but not exclusive to, images). We introduce a new, Optimal Transport-based representation learning method in close analogy with the usual Dictionary Learning problem. This approach typically relies on a matrix dot-product between the learned dictionary and the codes making up the new representation. The relationship between atoms and data is thus ultimately linear. 

We instead use automatic differentiation to derive gradients of the Wasserstein barycenter operator, and we learn a set of atoms and barycentric weights from the data in an unsupervised fashion. Since our data is reconstructed as Wasserstein barycenters of our learned atoms, we can make full use of the attractive properties of the Optimal Transport geometry. In particular, our representation allows for non-linear relationships between atoms and data.


 Previous Cosmostat Days on Machine Learning in Astrophysics :


Big Bang and Big Data

The new international projects, such as the Euclid space telescope, are ushering in the era of Big Data for cosmologists. Our questions about dark matter and dark energy, which on their own account for 95% of the content of our Universe, throw up new algorithmic, computational and theoretical challenges. The fourth concerns reproducible research, a fundamental concept for the verification and credibility of the published results.

Astrophysique et IRM, un mariage qui a du sens

La Direction de la recherche fondamentale au CEA lance le projet COSMIC, né du rapprochement de deux compétences en traitement des données localisées à l'Institut des sciences du vivant Frédéric-Joliot (NeuroSpin) et au CEA-Irfu (CosmoStat). Les mécanismes d'acquisition de données en radio-astronomie et en IRM présentent des similarités. Les modèles mathématiques utilisés sont en effet basés sur les principes de parcimonie et d'acquisition comprimée, dérivés de l'analyse harmonique.

Unsupervised feature learning for galaxy SEDs with denoising autoencoders


Authors: Frontera-Pons, J., Sureau, F., Bobin, J. and Le Floc'h E.
Journal: Astronomy & Astrophysics
Year: 2017
Download: ADS | arXiv


With the increasing number of deep multi-wavelength galaxy surveys, the spectral energy distribution (SED) of galaxies has become an invaluable tool for studying the formation of their structures and their evolution. In this context, standard analysis relies on simple spectro-photometric selection criteria based on a few SED colors. If this fully supervised classification already yielded clear achievements, it is not optimal to extract relevant information from the data. In this article, we propose to employ very recent advances in machine learning, and more precisely in feature learning, to derive a data-driven diagram. We show that the proposed approach based on denoising autoencoders recovers the bi-modality in the galaxy population in an unsupervised manner, without using any prior knowledge on galaxy SED classification. This technique has been compared to principal component analysis (PCA) and to standard color/color representations. In addition, preliminary results illustrate that this enables the capturing of extra physically meaningful information, such as redshift dependence, galaxy mass evolution and variation over the specific star formation rate. PCA also results in an unsupervised representation with physical properties, such as mass and sSFR, although this representation separates out less other characteristics (bimodality, redshift evolution) than denoising autoencoders.