Summary, July 31, 2020

We started this week’s meeting by discussing the possibility of organizing shared group meetings with other groups within CCA or within the broader astronomical community. While this group is focused on methodology, the scientific interests are broad so it could be useful to occasionally join forces with groups focused on specific research areas of interest to group members. It could also be useful as a method for encouraging random encounters between researchers who don’t normally overlap, especially while working remotely. Not everyone was completely enthusiastic about the idea, however, since there are already opportunities for group members to attend multiple group meetings within CCA and it might not be ideal to lose our one weekly meeting with the full group. It looks like we will try a test run with another group at CCA and, after that, consider options for future experiments like this.

Next, Ruth Angus asked for feedback on the project that she is proposing to the NSF Career program. She will be focusing on inferring rotation periods for stars using time series using data from the Zwicky Transient Factory and the Vera Rubin Observatory. There are many technical details from a time domain analysis side that she has been working hard on, but she was asking for feedback on the potential use cases for a huge catalog of rotation periods and gyrochronological ages for faint Milky Way stars. The group had many suggestions for applications and there was a good discussion of the project.

Rodrigo Luger talked about (gasp!) spherical harmonic representations of stellar surfaces and the resulting light curves. Building on the work he discussed last week, Rodrigo has worked out a better parameterization for the spherical harmonic expansion of spots on the stellar surface. He finds that if he parametrizes the spot using a Legendre polynomial expansion, all the spherical harmonic math that he needs to compute light curves turns out to be simple and the expansion of the spot is better behaved (no ringing and not going negative) than before. Furthermore, he has demonstrated that the time series Gaussian Process that he presented last week is the Gaussian density with the minimum K-L divergence to the distribution over stars with discrete spots. This has some interesting connections to variational inference.

Then, John Forbes shared a project where he is studying metal enrichment in Upper Scorpius. The data analysis problem that he presented is that he needs to infer the distances between a handful of stars and a specific molecular cloud, the potential site of enrichment. The projected distance (in the plane of the sky) is well measured, but John really needs the physical 3-dimensional distance and these stars only have noisy distance (from us) measurements. Therefore, he needs to marginalize over the line-of-sight distance when estimating his quantity of interest and he’s finding that the results are somewhat sensitive to his choice of prior for the 3-dimensional distribution of stars. There was some discussion about ideas for understanding this effect and making the best choices, but this conversation will have to also continue offline.

Adrian Price-Whelan and Christina Hedges updated the group on the project that they shared two weeks ago, where they had discovered a new young, nearby moving group. Since that meeting they have expanded their sample to include fainter and brighter stars that don’t have radial velocity measurements in the Gaia catalog. To determine membership probabilities, Adrian and Christina have implemented a probabilistic model that marginalizes over radial velocity when it is not measured. The resulting catalog has some contamination that can be removed using the RUWE metric from the Gaia pipeline. There are several bright (apparent magnitude of ~3) stars included in the cleaned sample and many fainter stars. The age estimates from isochrones, lithium abundance, and high-resolution spectroscopy all consistently suggest an age of about 200 Myr, but the new sample includes a non-trivial number of white dwarfs, which is a bit puzzling at that age. The suggestion was made that rotation periods might provide another good age estimate.

Finally, Sam Grunblatt asked for suggestions because he is running out of hard drive space for a project where he is currently generating millions of data files (including figures). There was some discussion of the relative merits of binary vs. ASCII formats for data products (use binary!), but Sam’s main issue is the huge number of plots that he’s saving. He generates these figures to make it faster and easier to page through and vet the results of a planet search using TESS. But, going forward he would like to automate this further so the recommendation was made that he avoid saving these figures for every target, restricting instead to the most “interesting” ones.

Summary, July 24, 2020


This week’s group meeting was a bit quieter than usual, but we still had some useful discussions.

Rodrigo Luger started us off with a discussion of a Gaussian Process kernel for light curves he has derived, using starry, that is parameterized by interpretable physical parameters of the target star’s surface: rotation period, inclination, active latitude, spot size, and spot contrast ratio. This is exciting because, while Gaussian Processes are often used to infer the properties of stars from their light curves, the models currently used are all phenomenological. Rodrigo has created an interactive widget (see the header image) to demonstrate, qualitatively, how the parameters of the GP model relate to the implied surface map. There was some discussion (and disagreement) about whether or not a physically motivated model like this could improve current planet detection methods (using light curves or radial velocities) either by increasing the sensitivity of these searches or by improving the reliability of the results.

Next, David W. Hogg updated us on an ongoing project with Megan Bedell and Lily Zhao (Yale) to improve the wavelength calibration of extreme precision radial velocity spectrographs (in particular, EXPRES). They have developed a data-driven method to identify a low-dimensional representation of the wavelength solution residuals that can be used to substantially improve the (already outrageously good) wavelength precision, but now they are looking to interpret this low-dimensional space (a risky endeavour, I would say). In particular, they have found that one factor that seems to correlate with their top principle component is the “vertical position” of the spectrum on the detector. However, there are some inconsistencies between how this effect manifests in the science spectrum and the laser frequency comb spectrum that they have not yet figured out. There was some discussion about whether or not this could be caused by different light paths, but Hogg is skeptical that that is the cause.

Finally, Christina Hedges led a discussion about tips and tricks that we have all learned from working remotely. Many members of the group lamented the lack of “random encounters” that happen when we’re all working in the same building, but there were some useful workarounds that people have found. For example, some smaller sub-groups have been doing regular “stand-up” meetings to help accountability and to help keep track of what other group members are working on. This solution, however, does not generally include the same diversity of voices that random encounters in the halls of Flatiron would. There was a discussion about the possibility of having an open video chat that stays open all day, giving people the opportunity to chat and ask questions of whatever group is there at some particular time.

Summary, July 17, 2020


Adrian Price-Whelan and Christina Hedges started off this week’s meeting with an update about the new nearby moving group that they found when looking into the pair of transiting planet systems that they discussed three weeks ago. The header figure of this post shows a color-magnitude diagram showing the likely moving group members that they identified using Gaia radial velocities and proper motions. Christina and Adrian have compiled compelling evidence that these stars are members of a previously unknown, youngish (100 Myr), nearby (~40 pc) moving group. One feature that you might notice in the above figure is that the main sequence seems to be truncated at the bright end. Adrian and Christina hypothesize that this is caused by Gaia DR2’s bright limit for radial velocities. There was some discussion of the fact that this means that some bright (naked eye?) stars might be members of this group.

Next, David W. Hogg asked the group for feedback on a possible undergraduate project idea where the student would look at the TESS light curves for the binary systems discovered by Price-Whelan et al.. There was some discussion about what features would be expected in these light curves (eclipses, ellipsoidal variations, rotation signatures, etc.). The group had various ideas of interesting things to study, such as limb-darkening profiles as a function of effective temperature and spun up primary stars in close binaries.

Then, Ruth Angus showed some interactive figures that she made using Bokeh based on results from Lucy Lu (Columbia & AMNH) who has been developing a method to estimate stellar ages using their Galactic kinematics. The technique currently depends on binning stars in color, magnitude, and rotation period space. Ruth and Lucy have been investigating how their results depend on the choices made in this binning procedure. The group suggested that perhaps this could be recast as a huge hierarchical model where the velocity distribution (instead of the velocity dispersion) is represented as a non-parametric function of these same parameters, but this would still require choices about representation that might not be easier.

Finally, I shared a data analysis challenge that Trevor David brought to me. He is fitting astrometric observations of a binary star system with an orbital period that could be on the order of 100 years. Needless to say, the phase coverage of the observations is not great (although they do have a longer than 20 year baseline). Trevor has been using my exoplanet code and following the astrometry tutorial, but finding some painful degeneracies between parameters (especially tricky is the phase of periastron passage). There was a discussion of possible reparameterization choices, including using a series expansion of the orbit as a function of time to choose reasonable parameters. It was also suggested that we could try a rejection sampling method like the one developed by Sarah Blunt for the obitize code, but I think it will be possible to come up with a more efficient scheme in this particular case.

Summary, July 10, 2020


This week’s meeting started with an update from Rodrigo Luger about his work to map the surfaces of stars using phase curves and occulations. In particular, he has been working on techniques for increasing the efficiency of his methodology when the nullspace of the problem is large or, in other words, when very little information is actually available in the light curve. He has discovered a technique that is similar to previous work by Emily Rauscher where it is possible to re-cast the inference problem into the (much lower dimensional) subspace where the data are constraining, thus drastically reducing the computational cost. However, it seems like this improvement no longer applies if we want to actually generate the distribution over surface maps that are consistent with the data. It was suggested that it might be possible to generate these maps as a post-processing step, after fitting for all the other parameters of the system.

Next, Emily Cunningham updated the group on her project to disentangle the accretion history of the Milky Way stellar halo by modeling the abundance distribution of halo stars using templates built from observations of dwarf galaxies. She has preliminary results looking at a set of Milky Way-like simulations which provide an idealized test for the methodology. The header figure shows examples of some of the templates that she is using. However, since the model Emily is currently using just models the current abundance distribution as a linear combination of these templates, the interpretation of the results can be a little tricky. In particular, she is interested in determining the number and distribution (in, say, stellar mass and age) of the accretion events that assembled the halo. There was some discussion about the possibility of applying some sort of non-parametric mixture-of-templates model to make this measurement.

Then Kate Storey-Fisher shared some results from her project to detect anomalous galaxies in survey images using deep learning. Kate previously blogged about this project so you can check that post out for more details. Since that post, Kate has been developing an auto-encoder, trained on the residuals between the data and the GAN predictions for each galaxy, that she can then use to determine how “anomalous” each galaxy really is. In this context, the auto-encoder is acting as a dimensionality reduction that identifies features that seem to do a good job of ordering the the galaxy images in terms of how typical they are. There was some discussion about the possibility of skipping the GAN step in the process and directly applying the auto-encoder to the observations, but it’s possible that this wouldn’t perform as well since the GAN does a good job of building a distribution over the population of expected galaxy images.

Next, Trevor David shared three versions of a plot that he is making for a paper. He has taken high-resolution, time series spectroscopy of a transiting exoplanet (or, more specifically, its host star) and identified some excess variability in the red wing of the H-alpha line, in the out-of-transit data. This variance is close to a narrow telluric feature and the significance is low, but high enough to be interesting, so it is a bit tricky to demonstrate exactly what is being observed. The group had many opinions about the choices made in these plots and gave suggestions for improving the clarity.

Finally, we finished the meeting with a discussion, led by John Forbes, about the merits and costs of packaging and documenting code from a research project that is niche but potentially useful to other researchers. The argument was made that the most impactful scientific software will probably be simple (very good at exactly one thing) or extremely flexible (with all the possible knobs). Author’s note: I would argue, that perhaps there’s one other mode: a good “plugin” interface so that users can implement the knobs they need. But perhaps that falls into the second catagory anyways… In contrast, there was also a discussion of the fact that sometimes its not possible to predict all the potential use cases of your code and there are some examples where research code has been successful repurposed for related, unexpected applications. We did not come to any real conclusion about whether or not packaging and documenting all research code is worth it (I think it is!), but it was some good food for thought.

Summary, June 26, 2020


David W. Hogg started out this week’s group meeting by sharing a project that he has been working on with Anna-Christina Eilers (MIT) to understand the systematic effects in stellar abundance measurements made by the APOGEE survey. They start with the (not totally justifiable) assumption that, while abundance will be a function of birth location, it should not be a function of evolutionary state. While this is observed for some elements (iron, for example) there are clear trends of calcium abundance with surface gravity in the APOGEE sample. This could the result of systematics or the fact that the observed abundance patterns for evolved stars will not be the same as their birth abundances. David W. and Anna-Christina are interested in deriving a non-parametric method for determining birth abundances from the current observed abundances and an estimate of the Galactocentric positions.

Next, Christina Hedges (NASA Ames) and Adrian Price-Whelan shared an interesting TESS discovery: a pair of nearby, young transiting planetary systems whose hosts were previously shown to be in a widely-separated co-moving pair. One of these targets hosts three transiting planets and is already already interesting in its own right, but the fact that it is in a co-moving pair with another transiting planet host makes this a potentially very important benchmark system. Christina and Adrian also have some preliminary evidence that these two stars might be members of a larger (previously unidentified) association. We discussed the fact that, especially if their masses are accessible by radial velocity, these planets could be very useful for comparative studies of planet formation and evolution.

Rodrigo Luger shared some work that he has been doing (also in collaboration with Christina Hedges) to measure stellar differential rotation and star spot evolution timescales using precise photometric time series. Using an extension of his starry method, Rodrigo has developed a computationally tractable method for inferring the statistics of the time-variable surface map of a star based on its rotational phase curve in integrated light. This measurement is extremely degenerate, partly because of physics (spot evolution and differential rotation can produce very similar light curves) and partly because of math (there are some parts of the system, called the null space, that just can’t ever be measured). But, Rodrigo has some very impressive results that demonstrate that he can recover robust inferences for the differential rotation shear and spot evolution lifetimes using a non-parametric model.

Kate Storey-Fisher shared some figures from a paper that she is working on where she has developed a novel estimator for the correlation function of galaxies. Traditionally these measurements are made by binning the measurements on different angular scales, but Kate has developed a class of methods that don’t require binning. Instead the correlation function is described by a set of smooth basis functions and the estimation procedure involves measuring the weights of these functions based on the data. Kate has demonstrated (using simulations) that this method produces lower variance and less biased estimates of the correlation function than the standard method, even when the model is much less flexible. One of these figures is shown at the top of this post and there was some discussion about color choices (what do you think?) and how to best visually compare the different methods in a figure like this. In particular, the suggestion was made that perhaps this figure would be clearer as two panels rather than one.

Finally, I complained about coding in C++ and how hard I find some things that I would easily do in Python (like iterate over a tuple of arguments for an arbitrary function, don’t ask). The only useful thing that I shared was a link that I had discovered (only like 5 years late) when making my slide: a tool that makes cute screenshots of code like this.

Summary, June 5, 2020

In today’s meeting the group had a discussion about racism (in particular anti-Black racism) and our role as individuals and as a group in perpetuating it. This was an uncomfortable conversation for some group members and many of the more vocal members of the group (myself very much included) are not yet well-equipped to lead such a discussion. We spent some time sharing recommended reading lists, online educational resources, and brainstorming ideas for how to use this group to learn. In particular, we have committed to continuing these discussions in the future, recognizing that combating racism is an ongoing process. We discussed the possibility of a regularly-meeting reading group to learn about racism within astronomy and more broadly, but the specific structure of this group is still to be determined. There was some disagreement about the scope of this discussion and whether or not it is possible to focus on racism within academia.

As is often the case in such an environment, we were much more comfortable discussing institutional racism and brainstorming things we can do as a group to address this within our institution. While there was unsurprisingly nothing groundbreaking uncovered in this discussion, we identified several immediate action items (mostly in the context of hiring) and recognized that this is something that we also need to continue to learn about.

Finally, we shared links to the #strike4blacklives and #ShutDownSTEM initiatives that are organizing a virtual walkout next week where non-Black academics will take (at least) one day to educate themselves and take action towards eliminating racism. More information about these strikes, as well as recommended actions to take, can be found at the above links.

Summary, May 29, 2020


David W. Hogg started this week off by leading a discussion about how one might define a “discovery” in astronomy with a specific focus on the context of exoplanet detections in radial velocity surveys. This discussion was inspired by work he has been doing with Megan Bedell to make observing strategy recommendations for the Terra Hunting Experiment, an upcoming decade+ long survey to detect extrasolar Earth-analogs using extreme-precision radial velocities. One of the core questions is whether or not it is sufficient to define a detection as a precise, non-zero, measurement of the radial velocity amplitude. It might also be necessary to include a statement about the precision with which the other orbital parameters are measured. For example, a radial velocity “trend” might not qualify as a detection even if the lower limit on the amplitude is significantly non-zero. This led to discussions about whether or not marginalization is required for quantifying these detections at the limits of the survey detection threshold.

Next, Emily Cunningham shared a new project where she is using the Latte suite of Milky Way-like galaxy simulations to develop data-driven methods for understanding the assembly history of the Milky Way stellar halo. Following on earlier work by Duane Lee, Emily is working to construct a library of “template” distributions in the space of stellar abundances. These templates will be constructed by looking at accreted dwarf galaxies in the simulations and then Emily plans to model the mixed population using a linear combination of these templates to determine the relative contribution of different classes of accretion events to the assembly of the halo. After testing these methods on the simulations, Emily plans to determine which changes will be required to apply these methods to the observed stellar population even in the face of selection effects and realistic uncertainties. There was some discussion of how this methodology is similar to a low-rank matrix factorization problem and how this relationship could be used to develop a data-driven method for constructing the templates without sacrificing interpretability.

Then, Jiayin (DJ) Dong shared a project where she has measured the eccentricity distribution of warm giant planets detected using data from NASA’s TESS mission. After detecting and performing a detailed characterization of the transit light curves for a sample of about 80 warm giant planets, DJ is using “the photoeccentric effect” in a hierarchical Bayesian model to make robust inferences about the eccentricity distribution. This model is high-dimensional and the geometry of the problem can be pathological in some cases, making robust inferences tricky, but DJ has figured out these technical issues and has a working implementation in PyMC3. It is already clear that the eccentricity distribution of these exoplanets is multimodal (as seen in the header figure). The final inferences that DJ makes about this distribution will have a profound influence on our understanding of the evolution of planetary systems and she will be able to place constraints on the relative frequency of eccentric migration or in-situ formation channels.

Finally, I (Dan Foreman-Mackey) showed some plots that I have generated as part of a project in collaboration with Mariona Badenas Agusti (MIT) to revisit an interesting eclipsing binary system observed by Kepler and now TESS, using my exoplanet code. There are many technical aspects of this system that make it hard to model (there’s a lot of data and many parameters), but the physics is also non-trivial (the orbit is precessing, the stellar oscillations are coherent and measured at extremely high precision, etc.). I’m excited to get exoplanet working for this system, but I’m still struggling to get robust inferences in finite time.

Summary, May 22, 2020


This week we had a smaller group meeting than usual (long weekend, summer weather, apocalypse, etc.), but we still had some interesting conversations!

Rodrigo Luger discussed some work he’s been doing to efficiently fit the light curves of differentially rotating stars (see the header figure). This was inspired by his collaboration with Fran Bartolić to map the time variable surface of Io using occultation light curves as discussed last week. In the limit of long lived spots, Rodrigo has demonstrated that many of the degeneracies commonly found when mapping stellar surfaces using phase curves can be broken. But, he’s now looking for good test cases (including this sample) and he will look at the issues introduced by variable spot lifetimes.

Next, John Forbes updated the group on a project that he previously blogged about where he (in collaboration with Nia Imara, Harvard) is quantifying the density structure in turbulent box simulations. The key ingredient of their method is that the eigenvalues (or ratios thereof) of the local Hessian matrix of the density (or log density) seem to separate the structures in these simulations along axes corresponding to the qualitative descriptors that are generally used. Now they’re working to get a better understanding of how to interpret these quantities by visualizing the Hessian matrix at various points in this parameter space. These are hard to visualize since the data live in three dimensions so they have been making 3D matplotlib animations of the stencil used to estimate the Hessian, but they remain hard to interpret. There was some discussion about alternative visualization techniques and whether or not it even makes sense to try to qualitatively interpret these features.

David W. Hogg shared some work he has been doing with Soledad Villar (NYU) to understand the relative performance of discriminative and generative models for prediction tasks in several toy scenarios. They find that generative models marginally outperform discriminative models in many of the test cases they try. In particular, they find—in agreement of previous results in the literature—that discriminative models have catastrophically poor performance when the number of training samples is approximately equal to the dimension of the data, even when the model performs well for both fewer and more training points.

Finally, Sam Grunblatt led a discussion about the definition of the Galactic stellar halo. He asked about the relative importance of kinematics, chemistry, and current coordinates. There was broad agreement that chemical and dynamical arguments are both important, but the discussion predictably devolved into the assertion that the Milky Way is not multi-modal in any dimensions and that classifications should be continuous rather than discrete.

Summary, May 15, 2020


Given the current state of the world, we have been holding the Astro Data Group meeting virtually for the past couple of months and this blog has taken a hit over that time. This week, I (Dan) have decided to try to start posting a short summary of what we talk about each week so we’ll see how long I keep it up :)!

Fran Bartolić started us off with a discussion of how he is using non-negative matrix factorization to map the time variable surface of Io using occultation light curves. In particular, he’s modeling the time variable surface as a linear combination of starry maps where the coefficients vary as a function of time. The results of this inference are prior dependent so he presented some of the prior information that he is considering: non-negativity constraints, sparsity priors, and Gaussian processes to capture smooth time variability and volcano duty cycles. There was some discussion about including spatial priors that would favor contiguous features in each map (for example, individual volcanos).

Trevor David has been investigating long run times and divergences in his PyMC3 inferences. To handle difficult posterior inferences, he has been increasing the target acceptance statistic in the tuning phase of his PyMC3 runs, but still ends up with some small percentage of diverging samples and long run times. Some of these issues can be handled by re-parameterization, but Trevor has been looking at various diagnostics to try and understand the behavior of his sampler. One such diagnostic is the comparison of the empirical marginal energy distribution and the conditional energy distribution (see figure 23 of Betancourt 2017 for an example), and Trevor finds a small skew in these distributions (with heavier tails to larger energies).

On a related topic, I presented some work I have been doing to improve the tuning performance for PyMC3 when there is a large dynamic range in posterior precision across dimensions. The idea is fairly simple—set the target acceptance statistic to something low (like 50%) at the beginning of the tuning phase and increase this target to the final value throughout tuning—but it seems to work well for at least some test cases. This is currently implemented in exoplanet, but more testing will be needed before this is ready for prime time.

Dreia Carrillo shared the figure at the top of this post and asked for feedback about its clarity and, in particular, the binning choices made in the bottom histogram. This figure shows stellar metallicity gradients as a function of birth radius in a Milky Way-like galaxy simulation. The main panel shows how these gradients change as a function of age, but doesn’t capture underlying density of stars so she included the histograms to show this distribution, and specifically to highlight the signatures of inside-out disk formation. The group discussed options for making the birth radius histogram plot clearer. These suggestions included using a kernel density estimate instead of a histogram and using fewer, wider age bins.

To finish, Emily Cunningham led a discussion of how members of the group manage their time when working on multiple parallel projects and how to identify when we are over-committed to too many projects. The group agreed that this is something we all struggle with, especially since we are currently all working in isolation. Group members have devised some tactics for dealing with this, including setting goals on week-long time scales instead of daily or by running work by collaborators more frequently. We also discussed the fact that side projects and context switching can be useful (especially these days) to help getting ourselves out of a research rut or to identify our next projects, even if they don’t always lead to visible results.

Ambiguous periods


A common problem with algorithmic determination of rotation periods from light curves is that the “best” period may actually be the undertone/subharmonic or harmonic of the true period (in other words, twice or half the true period). Two real examples from the Kepler mission are shown here.

In the first row of each figure, the Kepler light curve of a star is phase-folded on a period either from the literature or determined using a Lomb-Scargle (L-S) periodogram. In the second and third rows, the light curve has been phase-folded on half and twice the period in the first row, respectively. The bottom two panels show the light curve from a single Kepler quarter and the periodogram itself.

With KOI 49, the various estimates for the rotation period are in agreement, and it seems clear that those estimates are correct. Why? We see two peaks (and troughs) in the phased light curve of uneven height. The simplest explanation for this behavior is a fairly stable arrangement of two dominant spot groupings of different size and/or contrast on the surface. Reality is definitely more complicated, but the light curves of many stars can be approximated with a two-spot model.

For KOI 377, the literature period estimates (first four columns) are in agreement, while the L-S method (last column) prefers approximately half of that period. It’s worth noting here that the L-S method assumes a sinusoidal model for the data and has indeed selected the period which best approximates the data as a sinusoid. What’s going on?

Imagine this pathological case: a star with two spots equal in size, contrast, and latitude but separated by 180 degrees in longitude, spinning on an axis perpendicular to our line-of-sight. As this star rotates, an observer would record a light curve showing two peaks and two troughs of exactly equal height/duration, and would not be blamed for reporting the rotation period as half the true period.

What is the correct period for KOI 377? From the plot it’s not immediately clear. However, other clues exist in the light curve. Compared with KOI 49, the light curve is less “stable” and shows variations about twice as small in amplitude. From ensemble studies of Kepler light curves it’s known that rotation period is anti-correlated with variability amplitude, and differential rotation (which traces “stability”) correlates with stellar mass, which itself correlates with rotation period. So, one path towards determining more reliable rotation periods may rely on all available information, such as the variability amplitude and stellar parameters. In fact, group members Lucy Lu and Ruth Angus are working on just this problem with Astraea.