Summary, August 14, 2020

I started this week’s meeting by asking for feedback about our plans for the event that I am co-organizing in September. This event is the 2020 version (hopefully this is just a year, and not an indication that it will go up in flames) of the series of workshops where astronomers get together to collaborate on new projects related to NASA’s TESS mission. The previous 3 events have all been organized in-person and, since this event will be fully remote and international, there are various challenges (and opportunities) that we haven’t experienced before. One difference between this event and many other online conferences is that there will be no formal presentations. Instead, participants are encouraged to start new collaborations and try to learn something new. The group had a lot of good questions and feedback about our current plans. In particular, there was a good discussion about how important it will be to manage expectations, especially when people have experience with in-person events where they can walk across the room and ask questions of any other participant in real time. There were also some suggestions about how to encourage interactions between participants even if they are not directly collaborating. We’ll see how well we can execute on these plans!

Next, John Forbes discussed a data management/coding challenge that he’s currently facing. He has a large simulation where the properties of a set of about a million “particles” are saved to disk at a series of “times”. The output of this simulation can be thought of as a large multidimensional array where the first dimension is time, the second dimension is particle number, and the third indexes the various properties that are being tracked. This full dataset is too large to be loaded into memory and John wants to run an analysis on a subset of the most interesting particles at all times. However, since the data are stored as snapshots, loading the properties of a particular particle at all times is a computationally expensive task. Besides a few snarky comments about serialization formats, the group had several suggestions for how to improve this performance of this operation. The main observation was that the data are essentially stored in row-major order, while the operations would be more efficient on column-major ordered data. The group suggested several tools that could help here, including dask, Parquet, and Arrow.

And that was it… it is August, after all!