Coding in Julia vs. Python (when using existing Python Packages)

img

At this week’s meeting, I asked the group for some advice about coding in Julia vs. in Python. I am starting work on a project with David Spergel and Ruth Angus, simulating the exoplanet astrometry yield from the Nancy Grace Roman Space Telescope. We are going to look at synergies between Roman exoplanet astrometry and radial velocity (RV) observations from the Terra Hunting Experiment. In order to simulate the joint RV and astrometry signals, we are planning on using exoplanet or The Joker as well as some other exclusively Python packages. So, I asked for advice on whether it would still be “worth it” in terms of computational speed to write the code in Julia if the code I am writing will be primarily a wrapper for other Python packages. In addition, I was curious whether calling Python packages in Julia (using PyCall) would introduce a slowdown in the code.

Eric Ford said that there is no speed penalty in using PyCall in order to run Python packages in Julia. He also mentioned that Python actually works quite well as a scripting language, so if I will primarily be calling Python packages for the meat of the code, then it would work well to do so in Python. The conversation then turned into a deeper discussion about the possible benefits/downsides of re-implementing the existing Python packages in Julia. Dan Foreman-Mackey suggested that the codes themselves are quite replicable and it wouldn’t be too difficult to re-implement them in Julia. David Hogg brought up the benefit in writing my own code from scratch in terms of developing a deeper understanding of how the code works. He also suggested for me to “think about your goals and do what makes sense within those goals.” Megan Bedell and Adrian Price-Whelan then suggested an alternative approach could be to implement a few toy or simpler cases to get a feel for what’s going on, but ultimately using existing packages in production work. Dan Foreman-Mackey also mentioned that the Julia code wouldn’t be quicker than the existing Python packages, as these packages use C or C++ for the computationally expensive portions of the code.

The main takeaway I got from the conversation is that it would be possible to do the project in Julia, by re-writing the Python packages that I would need, but that this is only really the right approach if it is something that I am specifically interested in doing as a project in itself. Otherwise, if the primary goal is to quickly produce a functioning simulation of Roman astrometry + Terra Hunting RVs, then it is likely better to use the existing Python code bases and to write the wrapping/scripting code in Python as well.

Figure credit

Evolution of the exoplanet radius gap

img

At this week’s meeting I updated the group on an ongoing project examining the influence of age on the exoplanet radius gap, a relative scarcity of close-in exoplanets sized between 1.5-2 Earth radii. The feature is widely believed to be a signature of atmospheric loss, with planets below the gap having rocky compositions and planets above the gap requiring a few percent-by-mass atmospheres to explain their low bulk densities. In the atmospheric loss paradigm, some fraction of the rocky planets are the remnant cores of planets that lost the entirety of their atmospheres.

At the core of this project is the observation of a more sparsely populated radius gap at younger system ages. My question to the group was in regards to estimating the statistical significance of the feature in 1D (i.e. just the radius distribution) and 2D (i.e. the orbital period-radius plane). While the gap appears to be remarkably empty in 2D, the feature is somewhat washed out (or, less statistically significant) in 1D partially due to the fact that the precise location of the gap decreases with increasing orbital period. In the 1D case, David W. Hogg proposed that the data could be projected orthogonally to an assumed functional form of the gap. A grid search of the gap function parameters could be performed to test different projections and find the “best” or emptiest gap, which would also allow us to estimate the feature’s significance through simulations.

Christina Hedges brought up Hough transforms for the 2D case. Since there also appears to be a stellar mass dependence to the location of the radius gap (and conceivably a metallicity dependence), we would be interested in exploring the location of the gap in higher dimensions. Generalizing the Hough transform to higher dimensions was discussed, and Christina threw together a lovely notebook for the 2D case.

Finding Unresolved Stellar Companions in Kepler Planet Hosts with Gaia

img

At this week’s meeting, I updated the group about my project with D F-M, Andy Casey (Monash University), and Tim Morton (USC- the west coast one). We are looking into how to determine how the binarity of Kepler planet hosts influence planet formation. That sounds really exciting but “determine” is doing a lot of work in that sentence. First, we have to figure out which stars are (unresolved) binaries. There are a couple of different ways in which we’ll go about doing this, but the one highlighted this week is the fruit of Andy’s work.

Alongside the astrometric and RV measurements, Gaia also lists errors associated with fits. For unresolved binaries, trying to fit the astrometric or RV curve of a single star will yield an uncharacteristically large amount of error. Through Andy’s work (which I can’t claim to understand beyond the surface level), we can turn this error into a likelihood that the star is not a single star.

Incredibly, this seems to work really well! In the color-magnitude diagram above of the KIC stars with astrometric measurements, we can pick out the stars that have a low probability of being single above the main sequence; right where we would expect binaries to be.

Interestingly, the same plot for KIC stars with RV measurements looks very different. There are a few selection effects here we’ll have to figure out: not all stars have both astrometric and RV measurements and (as pointed out by Trevor) the population of binaries detected by astrometry is different from that of RV binaries. Known spectroscopic and eclipsing binaries are picked up through RV jitter but not astrometric jitter!

Figuring out how exactly accurate this is will be key. Because neither method is batting 1.000, exactly how to quantify “I’m pretty sure this star is not a single” represents an interesting puzzle for our near-future selves.

GPU-accelerated difference imaging

img

At this week’s meeting I shared an update about a project called PyTorchDIA led by James Hitchcock (St. Andrews) where we are looking into how numerical tools developed for machine learning can be applied to astronomy. Unlike many applications of PyTorch, TensorFlow, etc. in astronomy, I’m not talking about using neural networks. Instead, I’m talking about using PyTorch to compute fast convolutions and matrix-vector products to accelerate more “traditional” astronomy problems. This is somewhat similar in spirit to the wobble project led by Megan Bedell (that strangely has never appeared on this blog, Megan!).

For the PyTorchDIA project, James and I realized that we might be able to easily port existing difference imaging methods to run on the GPU using PyTorch since the bottleneck operation is solving a large linear problem where the design matrix is a convolution. This is exactly the type of operation encountered in deep learning applications and exactly what frameworks like PyTorch are designed for. James took this one step further, and instead of directly solving this linear system, he uses an iterative solver that only requires matrix-vector products (rather than a full Cholesky factorization). This allows us to relax some of the assumptions (like Gaussianity of the observation model) of the standard methods so that we can use a robust loss function instead of sigma clipping.

Using the GPUs available on Google Colab, James finds that his method is at least an order of magnitude faster than optimized implementations of other commonly used methods for difference imaging, across a wide range of realistic test cases. We will hopefully be submitting this paper in the next week or so and we are already planning on incorporating this method into the pipelines for several existing and upcoming difference imaging surveys. But, the real reason why I shared this with the group was that I believe that there are some (potentially) useful lessons learned:

  1. Astronomical imaging analysis shares a lot of structure with machine learning problems; let’s take advantage of that!
  2. Some of the linear problems that we work on in the group might be easier and faster to solve if we use iterative solvers rather than direct solvers. The scaling is different, but also the implementations have been optimized for the forward calculation and these methods can open the possibility of generalizing the loss function.

Aside: I also want to mention the awesome website weirdgalaxi.es (yes that is the domain name, but don’t click on that if you’re using Safari) that Kate Storey-Fisher made (and shared at this week’s meeting) as a continuation of the project that she previously blogged about.

APOGEE + COSMIC =? <3

img

After arriving at CCA a few weeks ago, I’m starting on a project with Adrian Price-Whelan and David W. Hogg. We are trying to figure out how to answer the question: What does it mean that APOGEE doesn’t have a bunch of binary systems containing black holes? Barring the obvious reason that APOGEE’s targets are low-ish mass stars and thus unlikely to be companions to BHs and their high-mass progenitors, we think it will be interesting to compare a bunch of synthetic binary populations generated with COSMIC to Adrian’s close binary catalog.

Step 0 of this process is to work out how to generate mock APOGEE catalogs with COSMIC data. This means I need to do real astronomy to convert the effective temperatures and bolometric luminosities in my simulations to (gasp) magnitudes of both the absolute and apparent kind. Luckily for me we found Tim Morton’s isochrones package that provides interpolated bolometric corrections from the MIST bolometric correction tables. The figure above shows a CMD that I made from a little test sample of a few synthetic binaries that follow the age and metallicity distribution of the m12i galaxy in the Latte Suite of the FIRE cosmological simulations. I was pretty excited to see the suuper obvious double main sequence and that COSMIC produced something that looks somewhat similar to what you’d expect (see e.g. Fig 5 here).

Steps 1 onward will include picking which kinds of binary interactions will most severely impact a BH + star population that APOGEE can see (hello BH natal kicks!) to set up our binary evolution model set and including dust extinction. I’m sure eventually I will also open real astro data files that at some point intheir distant post-processing past came from telescopes!

Estimating the Bolometric Light Curves of Supernovae

img

I’ve been working (slowly but surely) on an open source code extrabol which estimates the bolometric light curves of supernovae from an arbitrary set of filtered photometric observations. This code is based on a package by Matt Nicholl, SuperBoL.

Estimating the bolometric luminosity and black body temperature of a supernova is an extremely common analysis problem in time-domain astronomy as it allows astronomers to easily probe supernova explosion dynamics. Unfortunately, supernova light curves are often poorly sampled across multiple photometric filters, making it challenging to build a pseudo-bolometric light curve directly from the data.

In order to solve this problem, extrabol uses a 2D Gaussian process to extrapolate the supernova spectral energy distribution (SED) as a function of both wavelength and time. Extrabol is linked to the Spanish Virtual Observatory, a database of nearly every commonly used astronomical filter. Given the name of a telescope/filter, extrabol will link the observation to an effective wavelength. Using an arbitrary set of filters from the user, extrabol will estimate the underlying SED and then use this SED to estimate a black body temperature and radius. Eventually, the goal is to have extrabol linked with the Open Supernova Catalog, so that a user type in the name of a supernova and quickly build a bolometric light curve using public data.

For now, I am still working out some bugs! The Figure above shows an example light curve for a core-collapse supernova. Each observation is shown as a point (with error bars being smaller than the points). The GP interpolation is shown as lines, with uncertainty shown as shaded regions. Currently, I am assuming a mean function of a very dim magnitude, which works well before and after the supernova itself; however, if there is a large gap of data, the GP prefers artificially low luminosities. I got some very useful feedback from the Astro Data Group this week to investigate a more physically meaningful mean function, with residuals being modeled by the GP. My next step is figuring out if there is such a function which can be used for all flavors of supernovae!

The Great Emu Fight

img

At AstroHackWeek 2020 last week, a fight broke out - not between the supportive and friendly participants, but between emulators. I pitched a hack project on comparing emulation methods, mostly so I could show this emu vs. kangaroo fight video during my pitch, and a team of fighters assembled. All credit for this project shared with Catarina Alves, Johannes Heyl, Yssa Camacho-Neves, and Johnny Esteves!

One of my research projects is on emulating galaxy clustering statistics for speedier, more accurate inference of cosmological and assembly bias parameters. I had been wanting to try out different emulation methods on this data, as well as write up a pedagogical tutorial on emulation. In brief, emulators are regressors that imitate more complex models or simulations. They are trained on a limited set of simulation inputs and output and learn the layout of this (often high-dimensional) parameter space; then for any new set of inputs, they can quickly spit out output values, without performing a physical calculation or a full simulation.

Our hack team first wrote up a detailed explanation of emulators, complete with an example about a computational emu biologist; check it out in our complete tutorial here. Then we wanted to actually build some emulators and test them on astronomical data. My full-scale research dataset proved unwieldy for a week-long project, so I generated a set of two-point correlation functions using nbodykit. The winning emulator should be able to take the input values of (Omega_m, sigma_8, Omega_b) and predict the correlation function in each of 10 separation bins.

Each team member took on a different emulation framework. So far, our lineup includes an Artifical Neural Network, a Random Forest, a Decision Tree, and a Support Vector Machine (with Gaussian Processes on the way). The fighters tuned their implementations and hyperparameters to the best of their abilities. The fight was heated, with the emulators racing to train and test out their clustering predictions. You can see the predictions of the SVM compared to the true correlation function in the figure above.

To see the outcome, you’ll have to follow the full battle at The Great Emu Fight notebook. But the takeaways were:

  • The Support Vector Machine emerged as the clear winner in accuracy for every metric we tested, emulating the correlation function to within a tenth of a percent for most r-bins. The Artificial Neural network came in a close second. (The second figure above shows the R^2 error; you can see the large difference in accuracy between the SVM and ANN compared to the DTree and RF.)
  • However, the SVM was by far the slowest to train. The Decision Tree handily won the training round.
  • The Decision Tree and the ANN were the winners when it came to prediction time (this is super important for inference!).
  • The Random Forest was an all-around loser, being very slow at prediction and sadly inaccurate; we love and appreciate it anyway.
  • Most importantly, all of these results are super dependent on emulator implementation, hyperparameter tuning, data normalization, and the dataset itself! So the reigning champion likely won’t reign for long.
  • We built this project to be incredibly modular, so you can throw your own emulator or dataset into the ring. Follow one of our standalone notebook tutorials and tune the hyperparameters to improve on our results, or add in your own complete implementation, and join in the fight.

Check out the project in this repo, and follow the complete fight at The Great Emu Fight notebook. Huge thank you to the emu-fight team and the AstroHackWeek organizers!

How do people run big group meetings?

Last week we had our first meeting of the new semester and we decided to try something “new”. In fact we went back to how we ran things back in the olden days: we take the total meeting time and divide it equally between people who are online, and everyone gives an update. In many ways, this is really how we want group meeting to go. The whole point is that, since group members work on such a broad range of topics, we want to have an opportunity to hear what everyone else has been up to. The goal is to identify shared interests and encourage collaborations across research areas.

This time (unlike the olden days), we had 18 group members (or collaborators working actively on a project with a group member) online (and that isn’t even the whole group these days - I’m not complaining!). This meant that even with 90 minutes, the updates had to be very short and the whole thing felt like quite a whirlwind (and I needed to take a nap after logging off). As a result, I’m not going to write up a summary of all the awesome work that folks are doing. Instead, I want to ask if anyone has advice or thoughts about running big group meetings like this.

Overall, it seems like the group appreciated this format, but agrees that it might be a bit overwhelming to do it every week. So the current plan is to proceed with the format from the summer (a shared slide deck where group members can choose to add a slide and then share a 10 minute update with some discussion) for most meetings, and then about once a month we would have a whirlwind update from everyone. I don’t love this option, because this meeting is often the only time that I get updates from many group members. But, we haven’t been able to identify a format that helps to make sure that everyone gets and opportunity to share. We’d love to hear if there are other groups out there doing interesting things with their meetings. We’re always keen to try experiments, especially since we’ll be staying fully-remote for the foreseeable future!

If you have thoughts, please Tweet at me or send me an email (should be easy to find!)

Summary, August 28, 2020

img

Jiayin Dong (DJ) updated us on her project to study the eccentricity distribution of warm-Neptune exoplanets discovered using data from the TESS mission. She has discovered and carefully characterized about 100 of these transiting planets. She then uses the “photoeccetric effect” to infer the distribution of eccentricities based on the observed distribution of effective “circular stellar densities”. The inferred distribution is the observed population of orbital eccentricities but, if the detectability of the planet is correlated with eccentricity, this won’t necessarily be equivalent to measuring the true intrinsic eccentricity distribution. This will be an important effect (eccentricity will scale with orbital period and the probability that a planet transits is also a function of its eccentricity), so DJ asked the group for recommendations about how to take this into account. This led to a discussion about “inverse detection efficiency” methods versus likelihood-based methods for population inference. (I have opinions about this… I’m writing this post so I’m allowed to editorialize!)

Next, John Forbes shared a data analysis challenge where he has a simulation of a forming galaxy where he has snapshots of several quantities that, to order of magnitude, should sum to satisfy a differential equation. In practice, because of numerics and approximations this differential equation is not exactly satisfied, so he is interested in fitting for, possibly time and space invariant, offsets or scales for each of these quantities to best fit the constraints. Currently, he is using linear least squares to fit for multiplicative factors (that don’t depend on time or position) for each quantity, but finds that the results are not satisfactory. Several suggestions were made for relaxing these assumptions and fitting a more flexible model. One option would be to fit multiple galaxies together (in some sort of hierarchical model) to propagate information between them. Another would be to allow use a Gaussian Process either to model residuals away from the least squares model or as priors on the coefficients that enforce smoothness in time and space.

Then, Daivd W. Hogg shared a first draft of a Mission Statement and 5-Year Plan for the Astronomical Data Group. These highlight some of the things that we think we do well (encourage better data analysis practices, develop and support open source software, etc.), but also identify some of the places where we want to spend more energy. The goal is to use these documents to help focus research areas and other group efforts, as well as hiring decisions.

Finally, I gave an update on the online.tess.science event (hackathon/sprint/something) that I am co-organizing. It starts on Sept 8 (just over one week from now!) and there will be nearly 150 participants from around the world. The header figure for this post shows the times that each participant is planning on being online during the 48 hours of the event - this really is a global experiment! We are working to find the right balance of unstructured, asynchronous time and synchronous social/networking/check-in time with the goal of encouraging participants to meet a broad range of other community members. Much of this, as well as our recommendations for how to navigate an event like this, is being codified in a “Participant Handbook” that I’ve drafted with my co-organizers. I expect that this document will continue to be useful for future events, even if they are in-person.

Summary, August 14, 2020

I started this week’s meeting by asking for feedback about our plans for the online.tess.science event that I am co-organizing in September. This event is the 2020 version (hopefully this is just a year, and not an indication that it will go up in flames) of the tess.science series of workshops where astronomers get together to collaborate on new projects related to NASA’s TESS mission. The previous 3 events have all been organized in-person and, since this event will be fully remote and international, there are various challenges (and opportunities) that we haven’t experienced before. One difference between this event and many other online conferences is that there will be no formal presentations. Instead, participants are encouraged to start new collaborations and try to learn something new. The group had a lot of good questions and feedback about our current plans. In particular, there was a good discussion about how important it will be to manage expectations, especially when people have experience with in-person events where they can walk across the room and ask questions of any other participant in real time. There were also some suggestions about how to encourage interactions between participants even if they are not directly collaborating. We’ll see how well we can execute on these plans!

Next, John Forbes discussed a data management/coding challenge that he’s currently facing. He has a large simulation where the properties of a set of about a million “particles” are saved to disk at a series of “times”. The output of this simulation can be thought of as a large multidimensional array where the first dimension is time, the second dimension is particle number, and the third indexes the various properties that are being tracked. This full dataset is too large to be loaded into memory and John wants to run an analysis on a subset of the most interesting particles at all times. However, since the data are stored as snapshots, loading the properties of a particular particle at all times is a computationally expensive task. Besides a few snarky comments about serialization formats, the group had several suggestions for how to improve this performance of this operation. The main observation was that the data are essentially stored in row-major order, while the operations would be more efficient on column-major ordered data. The group suggested several tools that could help here, including dask, Parquet, and Arrow.

And that was it… it is August, after all!