In 2020, my overall goal was still to read 50 books, and I met that goal by reading 65. Once again, although I don’t spend much time purposely diversifying my reading, most of what I met was fiction (51 / 65), by Asians (29 / 65), and by women (57 / 65).

Here are the full comparisons between 2019 and 2020 overall:

And broken down by fiction vs non-fiction:

The 65 books were written by 54 unique authors. Of those authors, 41 were authors I had not read before.

First, let’s review my 2020 reading resolutions and see how I did.

Success! I increased my non-fiction reading in both absolute and relative terms.

Mixed! I doubled the number of books I read by Black authors, but the number of books by hispanic and indigenous authors stayed the same. Most of the increase came from books by white authors.

Fail! 29 / 65 books were written in 2020, and another 15 were written in 2019. Like last year, I only read 2 books written before 2000. This one is pretty difficult for me because there’s so many good new books coming out.

Yes, I’m being lazy and just re-using the ones I didn’t do great on in 2020.

*The Burning God*, by Rebecca F. Kuang: The highlight of my reading year.*The Paper Menagerie*, by Ken Liu: A masterful short-story collection, as anticipated.*The City We Became*, by N. K. Jemisin in March: sharp and hilarious. A very steep-departure from Jemisin’s earlier books in tone, but nevertheless very good.*The Empire of Gold*, by S. A. Chakraborty: A brilliant conclusion to the Daevabad triology.*A Desolation Called Peace*, by Arkady Martine: This one is now slated to come out in 2021.*Big Sister, Little Sister, Red Sister*, by Jung Chang: I managed to finish this crash-course in modern Chinese history, which helped a lot in understanding the historical background for the*Poppy War*series.*Tears We Cannot Stop: A Sermon to White America*, by Michael Eric Dyson: I don’t have much memory of this, but I did read it!

The searing conclusion to *The Poppy War* and *The Dragon Republic* was absolutely the highlight of my reading year. Nobody weaves brilliant story-telling, flawed yet compelling characters, and pain together as well as Kuang. I’ve written more about how much this series means to me here, and helped to write a post on the historical and cultural background behind the series.

Kobes Du Mez weaves many threads together to show modern white evangelicalism’s roots in preserving a certain vision of white manhood and patriarchy against 20th century gains by women and racial minorities. Along with Katherine Stewart’s *The Power Worshippers*, it clarified for me how white evangelicalism arose as reactions against the Civil Right Movement, Feminism, and continues now to react against the LBGTQ+ rights movement and Black Lives Matter as well as how evangelical social conservatives in the United States united with economic libertarians to create the American Right, the current iteration of the Republican Party, and Donald Trump. This one was also personal for me, as I grew up in enough of evangelicalism to recognize many of the figures in *Jesus and John Wayne*.

*The Third Son* tells the story of Saburo as he grows up in an abusive Taiwanese family underneath Japanese and then Nationalist rule and manages to break free by marrying his love, coming to the United States, and earning his PhD. The historical background is personal to me, as my parents also grew up in Taiwan and earned their PhDs in the United States (although they were firmly on the Nationalist side…).

The blurb describes *First Sister* as space opera x Handmaid’s Tale, but that doesn’t really do justice to the book’s originality in conception and execution. I’ve read enough fiction now to often anticipate what’s coming in a novel, but every turn in this one surprised me.

An epic account of a family’s experience in the Vietnamese War, as seen through the eyes of a girl and her grandmother. Nguyẽ̂n manages to capture the sweep of history along with intimate familial relationships.

I read all four of the books in this series this year. Tahir weaves a dark yet hopeful adventure in a dystopia that loosely resembles the Roman Empire. Like *The Poppy War*, this is another series where not many good things happen to the characters, but it never felt quite as dark and real to me as Kuang’s series.

Hong perfectly encapsulates the experience of being a “model minority” in the United States in a series of honest, emotional, and raw essays about shame and depression, the English language, and poetry. The titular “minor feelings” are “the racialized range of emotions that are negative, dysphoric,” yet not so spectacularly horrible as to be telegenic.

*A Desolation Called Peace* should finally come out in March. Ken Liu’s Dandelion Dynasty trilogy has been expanded to a Quartet, and I believe both books are releasing in 2021. I’ve been seeing a lot of hype in Asian book circles for *She Who Became the Sun*, by Shelley Parker-Chan, and we can always use more modern, fantasy, queer, re-imaginings of ancient Chinese history. *Jade Legacy*, the conclusion to Fonda Lee’s *Jade City* series, is due out in September. *The Second Rebel*, which follows *The First Sister*, comes out in November.

The full list is available here. The code to generate the plots is here

]]>- Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers
introduces
*Fast attention via orthogonal random features*(FAVOR). - Linformer: Self-Attention with Linear Complexity introduces
*linear self-attention*.

I’m going to summarize the main contribution of each paper, but you should definitely go read them for details such as theorems and theoretical insights. I also write everything using the notation from the FAVOR paper for consistency.

Let $L$ be the size of an input sequence of tokens. Then transformer dot-product attention is a mapping which accepts matrices $\mathbf{Q}$, $\mathbf{K}$, $\mathbf{V} \in \mathbb{R}^{L×d}$ as input where $d$ is the hidden dimension. Matrices $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$ are intermediate representations of the input and their rows can be interpreted as queries, keys and values of the continuous dictionary data structure respectively. Transformer dot-product attention is defined as

\[\operatorname{Att}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathbf{D}^{-1}\mathbf{AV}\]where the attention matrix is $\mathbf{A} = \operatorname{exp}\left(\frac{\mathbf{QK^T}}{\sqrt{d}}\right)$ and $\mathbf{D} = \operatorname{diag}(\mathbf{A1_L})$ is the normalizing factor. For details, see the original paper or this post from Harvard NLP.

While dot-product attention has proven very useful, its runtime and memory scale as $O(L^2d + Ld)$ and $O(L^2)$, respectively, because $\mathbf{A}\in \mathbb{R}^{L\times L}$ must be computed and stored explicitly. In practice, applications in natural language processing commonly limit the sequence length to 512 or 1024. This is also a significant constraint when modeling proteins. For example, *Streptococcus pyogenes* CRISPR-Cas9 is 1368 amino acids long.

The attention matrix $\mathbf{A}$ can be decomposed as \(\mathbf{A} = \mathbf{D_Q B D_K}\) with

\[\mathbf{D_T} = \operatorname{diag}\left[ \operatorname{exp}\left(\frac{\|\mathbf{T}_1\|_2^2}{2\sqrt{d}}\right)\ldots \operatorname{exp}\left(\frac{\|\mathbf{T}_L\|_2^2}{2\sqrt{d}}\right)\right] \forall \: \mathbf{T} \in \{\mathbf{Q}, \mathbf{K}\}\]and

\[\mathbf{B} \in \mathbb{R}^{L \times L}, B_{i, j} = \operatorname{exp}\left(-\frac{\|\mathbf{Q}_i - \mathbf{K}_j\|}{2\sqrt{d}}\right)\]Naively, $\mathbf{D_T}$ requires $O(Ld)$ time to compute while $\mathbf{B}$ requires $O(L^2d)$, arriving at the overall time complexity of $O(L^2d + Ld)$ for dot-product attention.

FAVOR is a fast method for approximating $\mathbf{B}$. $\mathbf{B}$ is a the Gaussian (squared-exponential) kernel matrix between the rows of $\mathbf{Q}$ and $\mathbf{K}$ with $\sigma = d ^ {\frac{1}{4}}$. Like most kernels used in machine learning, the Gaussian kernel has a fast random feature approximation. Given a random mapping $\phi: \mathbb{R}^d \to \mathbb{R}^M$ of the form

\[\phi(\mathbf{x}) = \sqrt{\frac{2}{M}}\operatorname{cos}(\mathbf{Wx} + \mathbf{b})^T\]where $W_{i, j} \sim \mathcal{N}(0, \sigma^2)$ and $b_i ~\sim \operatorname{Unif}(0, 2\pi)$. As described in the paper, choosing orthogonal random features instead of sampling independently decreases the variance of the approximation.

\[K(\mathbf{x}, \mathbf{y}) = \mathbb{E}\left[\phi(\mathbf{x})^T\phi(\mathbf{y})\right]\]Define randomly-featurized keys and queries as $\mathbf{\hat{Q}} = \sqrt{\frac{2}{M}}\operatorname{cos}(\mathbf{WQ}^T + \mathbf{b})^T$ and $\mathbf{\hat{K}} = \sqrt{\frac{2}{M}}\operatorname{cos}(\mathbf{WK}^T + \mathbf{b})^T$. Combine these with $\mathbf{D_T}$: $\mathbf{Q’} = \mathbf{D_Q}\mathbf{\hat{Q}}$ and $\mathbf{K’} = \mathbf{D_K}\mathbf{\hat{K}}$. Therefore, \(\mathbf{A} = \mathbb{E}\left[\mathbf{Q'}\mathbf{K'}^T\right]\)

And $\mathbf{\hat{A}} = \mathbf{Q’}\mathbf{K’}^T$ is an unbiased estimator of $\mathbf{A}$. However, would still like to avoid computing and storing the full $L \times L$ attention matrix. We can do this when calculating approximate dot-product attention by being clever in how we group the matrix multiplications:

\[\operatorname{\hat{Att}}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathbf{\hat{D}}^{-1}\mathbf{\hat{A}V} = \mathbf{\hat{D}}^{-1}(\mathbf{Q'}(\mathbf{K'}^T\mathbf{V}))\]with

\[\mathbf{\hat{D}}^{-1} = \operatorname{diag}(\mathbf{Q'}(\mathbf{K'}^T\mathbf{1}_L))\]This requires $O(LMd)$ time and $O(Md + Ld + ML)$ space.

Instead of approximating the attention matrix $\mathbf{A}$, the second paper directly approximates the result of dot-product attention using *linear attention*. First, they prove that dot-product attention is low-rank, and then propose to replace the $L \times L$ attention matrix with a $L \times k$ approximation by using a learned weight matrix $\mathbf{E} \in \mathbb{R}^{k \times L}$ to project $\mathbf{K}$ into $\mathbb{R}^{k \times d}$:

Likewise, the values are also projected to $\mathbb{R}^{k \times d}$, and

\[\operatorname{\hat{Att}}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathbf{\hat{D}}^{-1}\mathbf{\hat{A}FV}\]with $\mathbf{F} \in \mathbb{R}^{k \times L}$ and

\[\mathbf{\hat{D}}^{-1} = \operatorname{diag}(\mathbf{\hat{A}}\mathbf{1}_L)\]This requires $O(Lk)$ time. In practice, the authors find that $k \geq 256$ seems to perform well.

Some random thoughts on these papers.

- While both of these approximations perform comparably well to dot-product attention in the experiments presented in the papers, those experiments seem pretty perfunctory to me. I’d be more impressed if they demonstrated for a real problem that using one of these approximations opens up new possibilities. Concatenating random proteins together to length 8096 isn’t a real problem!
- The FAVOR paper presents a general kernel framework for attention and investigate the performance of some simple kernels in their experiments. It’d be fun to try and design attention kernels for specific problem domains.
- FAVOR makes a big deal about how they have an unbiased estimator for $\mathbf{A}$, but I’m pretty sure their $\operatorname{\hat{Att}}$ is not an unbiased estimator of dot-product attention.
- The linear projections in linear self-attention ($\mathbf{E}$ and $\mathbf{F}$) are dependent on the sequence length, which would make dealing with variable-length inputs messy.

First, I really enjoyed my time at Generate. I got to work on interesting scientific problems with very smart, nice people. They paid half my way to AISTATS and all my way to NeurIPS. I especially want to credit Andrew Beam for essentially giving me a postdoc’s worth of learning how to be a machine learning scientist. However, as I thought more about the kind of work I want to do and how I want to be evaluated for career advancement, I realized that there was a mismatch between what I wanted and what Generate needed.

As I realized that, my friend Lester Mackey at MSR told me I should apply for their new Comp Bio posting, which stated in part:

Microsoft Research offers an exhilarating and supportive environment for cutting-edge, multidisciplinary research, both theoretical and applied, with access to an extraordinary diversity of big and small data sources, an open publications policy, and close links to top academic institutions. We seek applicants with the passion and ability to craft and pursue an independent research program, including a strong publication record at top research venues.

As it turns out, that description aligns very well with what I want to do!

Through my time teaching, in my PhD, and at Generate, I’ve distilled what I want from my work into two parts:

I’d like intellectual ownership of a research agenda:

- To work on problems that have the potential to improve society
- In fields I think are interesting (currently machine learning, biology, and to some extent social science)
- And to be able to take my work with me and talk about it if I change employers.

In my PhD and at Generate, I focused on applying machine learning to proteins. I’m still interested in that, but I’d also like to broaden my research scope.

Conversely, I am not interested in developing products, and I’ve been telling people for years that one of my career goals is to not think about business questions. Generate, and probably other Flagship startups, are great places to work if you want to build a biotech company as a scientist. They’re generally well-run and scientifically sound, and you get to be part of a team effort to solve hard problems. However, one limitation of working in biotech, especially at a startup, is that the need to make money in the medium term strongly constrains what you can work on, as does the underlying philosophy that `creates value for investors`

↔ `creates value for society`

. The downstream effect of this is that scientists (and everybody else) are not rewarded directly for what they discover but for how they are helping the company create value. To progress in their career, even individual contributors have to map their scientific efforts to business questions.

Dissemination and being part of a global research community is an essential part of science. This includes:

- Publications
- Talks
- Teaching
- Conferences
- Other writing
- Mentoring other scientists

I want to be somewhere where these essential scientific activities are easy to do and rewarded. I don’t want to ask lawyers or a board for permission months in advance to publish or give a talk. However, biotech companies create investor value in part by obtaining proprietary data and inventing methods and protecting it all as trade secrets or with patents. Publishing or otherwise communicating your best models and data is obviously antithetical to this way of creating value. Therefore, even when/if scientists at biotech companies are allowed to publish, give talks, teach, or review papers, it’s not how they’re evaluated for career progression.

Given all this, it’s fair to ask why I worked for Generate in the first place. Coming out of grad school, I didn’t have all this sorted myself, nor did I anticipate exactly what working in biotech would be like. I do want to be clear that any mismatch between expectations and reality were due to my inexperience and naivety, not because anybody at Generate misled me.

I also want to note that I’m not advocating for researchers to “only care about the science” without regard for the societal impacts of their work. Instead, I’m arguing that startup-style “value-creation” and business questions in general are insufficient (and uninteresting to me) proxies for societal impact. The most obvious cases are therapeutics that are not pursued because they wouldn’t be profitable, such as coronavirus vaccines pre-2019 or treatments for diseases that primarily affect poor countries. In another example, the siloing and privatization of biomedical science impairs our ability to build on earlier research to speed up the development of life-saving drugs.

I considered applying for academic jobs. My dad’s a professor, and most tenured professors I know seem to have really nice lives. Unfortunately, trying to earn tenure with a toddler seems very challenging for me and unfair to my wife for how limited I would be as a parent.

Furthermore, the incentives in academia (to build one’s personal brand by publishing and getting grants) also do not align perfectly with societal impact. I’m not so naive as to think Microsoft or any other tech giant aligns perfectly with societal impact either. However, I’m hoping that being away from the directly profit-making part of the company will provide some insulation from business questions.

When I was looking for jobs at the end of my PhD (after not getting chosen for the Google AI Residency), Yisong Yue asked me what my ideal position would be, and I told him working as a researcher at a prominent industrial research lab. Here’s hoping MSR is everything I hope it will be!

]]>Sometimes the fact that all proteins are not the same length is the bane of my existence.

One way we deal with variable-length sequences in machine learning is to pad them all to the same length. For example, if I have the following nucleotide sequences:

```
AGCTAG
AGCTAGA
AGCTA
```

I can right-pad them by introducing a special padding token, let’s call it `p`

.

```
AGCTAGp
AGCTAGA
AGCTApp
```

And then I can use this as a batch input to the machine-learning model of my choice by assigning each nucleotide and the padding token an integer:

```
x = torch.tensor(
[
[0, 2, 1, 3, 0, 2, 4],
[0, 2, 1, 3, 0, 2, 0],
[0, 2, 1, 3, 0, 4, 4]
]
)
```

If I want to put sequences through a transformer or RNN or CNN and I don’t want the model to be affected by the padding, I can also pass in a mask (`mask = (x != 4)`

) and use that mask to set things to zero where appropriate. For example, the PyTorch `Transformer`

class uses this sort of mask (but with a `ByteTensor`

) for its [`src`

/`tgt`

/`mask`

]`_padding_mask`

arguments.

Unfortunately, `nn.BatchNorm1d`

doesn’t support this type of masking, so if I zero out padding locations, then my minibatch statistics get artificially lowered by the extra zeros. Given Pytorch’s object-oriented nature, the most elegant way to implement masked batchnorm would be to extend one of their classes and modify the way minibatch statistics are calculated.

Starting at `nn.BatchNorm1d`

, we find that all this class implements is a method for checking input dimensions:

```
def _check_input_dim(self, input):
if input.dim() != 2 and input.dim() != 3:
raise ValueError('expected 2D or 3D input (got {}D input)'
.format(input.dim()))
```

It’s superclass (nn._BatchNorm) has a `forward`

method, which checks whether to use `train`

or `eval`

mode, retrieves the parameters needed to calculate the moving averages, and then calls `F.batch_norm`

. `F.batch_norm`

in turn calls `torch.batch_norm`

. Clicking on that in github leads back to `F.batch_norm`

: I think it must be actually implemented in the lower-level cpp code.

In any case, it looks like there’s no straight-forward way to extend PyTorch’s batchnorm implementation, so time to write it from scratch.

Given a `(B, 1, L)`

mask, we first mask and then compute the number of unmasked locations over which to calculate the minibatch statistics:

```
if input_mask is not None:
masked = input * input_mask
n = input_mask.sum()
```

Then calculate the minibatch mean:

```
masked_sum = masked.sum(dim=0, keepdim=True).sum(dim=2, keepdim=True)
current_mean = masked_sum / n
```

And the minibatch variance:

```
current_var = ((masked - current_mean) ** 2)
current_var = current_var.sum(dim=0, keepdim=True).sum(dim=2, keepdim=True) / n
```

The full module is available as a gist.

Because I didn’t want to dig deeper into PyTorch source, there’s a few limitations here.

- It’s almost certainly not as fast as the native PyTorch implementation.
- If you’re doing multi-GPU training, minibatch statistics won’t be synced across devices as they would be with Apex’s SyncBatchNorm.
- If you’re doing mixed-precision training with Apex, you can’t use level
`O2`

because it won’t detect that this is a batchnorm layer and keep it in float precision.

I read mostly for pleasure, and still mostly fiction and mostly SFF (43 / 53 books). When I do read non-fiction, I tend to read history and sociology, especially pertaining to race in the United States. Although I don’t actively try to diversify my reading or seek out books by women, my taste does tend towards books by Asians (28 / 53) and books by women (34/53). Knowing the importance of reading diverse books by diverse authors, I wanted to see how I did in this aspect in 2019 and be more intentional about the books I seek out in 2020.

Note that the numbers for race and gender don’t add to 53 because I didn’t include author statistics for the anthology *A People’s Future of the United States.*

Things get a little but more interesting if I break by fiction/non-fiction and consider race x gender. Some observations:

- The gap between male and female authors is entirely because I didn’t read any fiction by non-Asian male authors.
- I only read one book by a Hispanic author.
- I didn’t plot the publication years, but I only read two books published before 2000 (and for one of those, the English translation appeared in 2019), and 26 of the books were published in 2019. In large part, this is due to me getting a lot of book recommendations from Twitter. However, it’s also true that SFF has undergone a revolution in the last decade or so, and I have a definite preference for these new voices in the genre. There’s also been a surge in SFF translated from Chinese, spearheaded by the inimitable Ken Liu.
- I don’t have time right now, but it would be interesting to disaggregate the Asians.
- Another thing this plot doesn’t show is the number of distinct authors: for example, both the books by Indigenous authors are by Rebecca Roanhorse.

- I should read a little bit more non-fiction.
- I should read more books by black, hispanic, and indigenous authors.
- I should read more older books.

My first SFF stories were excerpts from Chinese mythology read to me by my dad. I didn’t know I could read SFF in English that incorporated that world and history until I came across RF Kuang’s 2018 book *The Poppy War*. I read all 544 pages in one day. *The Dragon Republic* is the sequel, and if anything even better written and more gut-wrenching. Seriously, Kuang really likes to torment her characters. Following her on Twitter is also what’s biased my reading towards SFF by Asian women, and I have no regrets.

This is probably the best-written novel I read in 2019. Ng masterfully weaves seemingly disparate threads into a cohesive story that speaks to intimate familial connections, the complex intersections of race and class, and the ramifications of choices and secrets. Interestingly, the story is set in suburban Ohio, where I grew up, and Ng now lives in Boston, where I now live.

In *Stamped from the Beginning*, Kendi lays out a thesis that there are two strains of anti-black thought. Segregationists contend that some races are inherently inferior, and so our current racial hierarchy is the natural order. Assimilationists contend that some races act in inferior ways, whether because of culture or past discrimination, and so our current racial hierarchy will go away when those races learn to not act in inferior ways. While almost everybody now recognizes segregationist ideas as racist, it’s important to realize that assimilationist ideas are also racist. Both posit that some racial groups are superior to others. Anti-racist ideas hold that racial groups are equal, and that the only thing inferior about black people is their opportunities. *How to Be An Antiracist* is primarily a memoir describing how Kendi came to these conclusions. However, it also introduces some new ideas, some of which went against what I knew about anti-racism. For example, Kendi pushes back against the idea that black people cannot be racist and the concept of “microaggressions.”

I think this book about a sassy Chinese girl in segregated Atlanta is technically Young Adult fiction. I really enjoyed the main character’s voice and Lee’s sense of humor.

*The Burning God*is the final installment in the*Poppy War*. Kuang keeps putting out screenshots and teasers on Twitter, and I’m a little upset I have to wait until November.- Ken Liu’s short story
*The Paper Menagerie*is the first and only work of any sort to win a Hugo, a Nebula, and a Locus. He’s coming out with his second short story collection,*The Hidden Girl and Other Stories*, in February. - After wrapping up the Broken Earth Trilogy, which won three Hugos in a row, N. K. Jemisin releasing
*The City We Became*in March. Kuang keeps mentioning on Twitter that she’s read the ARC and that it’s really good. - The first two installments of S. A. Chakraborty’s Daevabad trilogy have an entracing world and captivating story. The third book,
*The Empire of Gold*comes out in May. *A Memory Called Empire*is one of those books that feels totally fresh and opens the genre to new directions. In May, Arkady Martine will release the sequel*A Desolation Called Peace*.- I started reading
*Big Sister, Little Sister, Red Sister*, by Jung Chang, but then had to return it to the library before I finished. It’s ostensibly a biography of the three Soong sisters, but really it’s a gripping crash-course in 20th-century Chinese history. I’d really like to know more Chinese history! From what I’ve read, I’ve already been questioning a lot of the bits I picked up from going to the Sun Yat-Sen and Chiang Kai-Shek memorials in Taipei. In retrospect, I should probably have expected those to be slightly biased. - I got
*Tears We Cannot Stop: A Sermon to White America*as a Christmas present, and it looks really good.

Looking at these, I don’t think they’re going to help me that much towards the resolutions I listed above…

The full list is available here

]]>See previous posts (1, 2, 3) for thoughts on the first three days.

The Expo only had two biotech companies (BenevolentAI and Novartis), but there are some unexpected companies interested in machine learning on proteins. I talked to scientists or saw papers from Google, Salesforce, Amazon, and Facebook. It will be interesting to see whether this remains a fun side-project for scientists at big tech companies, or if they’ll be making bigger investments.

In general, the talks are not as full as you’d expect given the scale of the conference. Relatedly, multiple people have told me that their strategy for NeurIPS is to mostly ignore the talks and meet with friends and colleagues new and old. With that said, the keynotes do tend to fill up, and the poster sessions were basically science mosh pits.

I noticed a lot of momentum in expanding deep learning and machine learning beyong achieving high predictive accuracy. I have a particular soft spot for all the work on Bayesian deep learning and uncertainty quantification, and the first half of the BDL workshop was a definite highlight. Yoshua Bengio gave a keynote on how the field needs to move to models with stronger “consciousness” priors that generalize better to unseen situations. This would out of distribution generalization and transfer would be the first step towards higher-level cognition and agents with internal world models that can reason about causality and seek out knowledge.

There’s also a lot more (and very necessary) interest in fairness, ethics, and safety. I missed the opening keynote, but moving from not wanting to change the conference name to having an opening keynote mention the #MeToo movement is encouraging. There were workshops on safety, climate change, and fairness in healthcare.

**Biological Sequence Design using Batched Bayesian Optimization**

David Belanger, Suhani Vora, Zelda Mariet, Ramya Deshpande, David Dohan, Christof Angermueller, Kevin Murphy, Olivier Chapelle, Lucy Colwell.

You know your subfield is getting popular when Kevin Murphy shows up on the author list. Well-done overview of some of the difficulties of applying Bayesian optimization to biological sequences, with comparisons between some methods.

**Wat heb je gezegd? Detecting Out-of-Distribution Translations with Variational Transformers**

Tim Z. Xiao Aidan N. Gomez Yarin Gal.

Comes up with a way to estimate epistemic uncertainty on sequence predictions and some good ideas on how to measure the quality of an uncertainty measure.

**Deep Mean Functions for Meta-Learning in Gaussian Processes**

Vincent Fortuin, Gunnar Rätsch.

Using a NN for the mean function in a GP seems like it should lead to better performance, but maximizing the marginal likelihood leads to overfitting because the marginal likelihood does not regularize the hyperparameters of the mean function. We can get around that by using meta-learning to learn a prior mean parameterized by a NN. This seems related to neural processes.

**Reconstructing continuous distributions of 3D protein structure from cryo-EM images**

Ellen D. Zhong, Tristan Bepler, Joseph H. Davis, Bonnie Berger.

Recent advances in hardware have enabled much higher resolution cryo-EM structures. However, most protein structure work ignores the fact that proteins exist as dynamic ensembles of structures. This work uses a VAE to reconstruct the entire ensemble from cryo-EM images.

- Walking up to people I don’t know is very hard and stressful, but if I set up an appointment with somebody, then talking to them isn’t stressful at all.
- However, having conversations in two emails, Twitter, Whova, FB messenger, SMS, and WeChat all going at once
*is*a little stressful. - I wish I’d contacted more non-bio people doing work I’m interested in. All the ones I talked to were delightful!

There are a few papers here with clear ethical issues. For example:

Also see this

The worst part is that a lot of these are trying to extract information that we have no reason to believe exists in the data.

And no, rejecting papers for ethical reasons is no more censorship than is rejecting papers for poor methodology. In other fields, certain topics or experiments require ethical review before doing the work, and publications won’t even review papers that were done without this initial approval.

Not as many papers today because I was too tired to do as much pushing through crowds at the poster session.

**Thompson Sampling and Approximate Inference**

My Phan, Yasin Abbasi-Yadkori, Justin Domke.

Variational inference is much faster than exact methods for computing Bayesian posteriors, but even small errors in the approximate posteriors lead to very poor performance (linear regret) when they are used in Thompson sampling. If the posterior estimates are under-dispersed, forced exploration can restore strong performance.

**Can Unconditional Language Models Recover Arbitrary Sentences?**

Nishant Subramani, Sam Bowman, Kyunghyun Cho

Given a language model decoder and a sentence, we can find an encoding that reconstructs the sentence perfectly when passed through the decoder.

**Practical Two-Step Look-Ahead Bayesian Optimization**
Jian Wu, Peter I. Frazier.

Bayesian optimization algorithms typically only look one step ahead during the acquisition function because methods that look two or more steps ahead are either too expensive or too inaccurate to be useful. This paper combines a few techniques to make accurate two-step look-ahead feasible.

- Trenton Bricken
- Jonathan Niles-Weed
- Amy Lu
- Alex Lu
- Alan Moses
- Philip Paquette
- David Yang
- Sam Sinai
- Elīna Locāne
- Neil Thomas
- Roshan Rao
- Nicholas Bhattacharya

I also ran into Neil Lawrence in the elevator and Andreas Krause in front of a poster today. I recognized Neil’s voice from Talking Machines. I’ve worked with several people who had previously worked with Andreas Krause, and refer to a few of his papers frequently, so I was very excited to hear that he’d read one of my papers.

**The Cells Out of Sample (COOS) dataset and benchmarks for measuring out-of-sample generalization of image classifiers**

Alex X. Lu, Amy X. Lu, Wiebke Schormann, Marzyeh Ghassemi, David W. Andrews, Alan M. Moses.

Neural networks often learn confounders in biological data, such as the day of the experiment, the facility, or the position in a plate. The COOS dataset tests classifier performance when generalizing across these effects. Surprisingly, a vanilla supervised learner generalizes better than pre-trained models.

**Single-Model Uncertainties for Deep Learning**

Natasa Tagasovska, David Lopez-Paz.

Separately model aleatoric uncertainty using quantile regression and epistemic uncertainty by training a classifier to predict whether test examples are from the training distribution.

**Image Synthesis with a Single (Robust) Classifier**

Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Andrew Ilyas, Logan Engstrom, Aleksander Madry.

If you make your image classifier robust to adversarial examples, you can use it to generate realistic samples.

**Adaptive Sequence Submodularity**

Marko Mitrovic, Ehsan Kazemi, Moran Feldman, Andreas Krause, Amin Karbasi.

Problems that involve finding the optimal ordering of edges in a graph, such as choosing the order in which to recommend movies or video games, turn out to be submodular. Therefore, once the graph is computed, the order can be greedily optimized with strong optimality guarantees.

**Computing Full Conformal Prediction Set with Approximate Homotopy**

Eugene Ndiaye, Ichiro Takeuchi.

Transductive conformal regression traditionally involves calculating conformal scores for an intractible number of possible predictions. We can get around that by making some assumptions about the smoothness of the underlying predictive function. This results in slightly looser prediction intervals but maintains validity. Unlike inductive conformal regression methods, there’s no need to split the training data into a proper training set and a conformal scoring set.

- NeurIPS is humongous. It’s spread over two multiple rooms in two buildings, and the whole place is crawling with people desperately trying to charge their phones and laptops. At one point, the evening poster session was full and they were turning people away.

- With that said, this is the first conference I’ve been to where I know people. Between people from my company, people I met at AIStats, and people who have reached out to me asking to meet, I really feel much more comfortable here than I have at any other conference.
- I wish the oral and spotlight titles were in the printed program. The conference app is ok, but it’s a fair number of clicks to find the talk titles.
- The sponsor booths here are much more impressive than they were at ISMB.
- Novartis and Benevolent AI are the only Biotech companines in the sponsor area. The rest of them really need to step it up and stop letting the financial companies have such an outsize presence.
- Sometimes it’s fun to go to a session about things I don’t really study (reinforcement learning, in this case)
- The dynamic range of talk quality is huge. I’ve started to just tune out talks after the second slide if I can’t follow.

**Combinatorial Bayesian Optimization using the Graph Cartesian Product**

Changyong Oh, Jakub M. Tomczak, Efstratios Gavves, Max Welling
If you want to optimize over combinatorial input spaces, then it makes a lot of sense to encode them as graphs and use graph kernels.

**Learning to Perform Local Rewriting for Combinatorial Optimization**

Xinyun Chen, Yuandong Tian

More combinatorial optimization, this time by learning one policy that chooses what to change, and another policy that chooses how to change it. The policies are rewarded based on the final result, so this should do better than greedy.

**A Simple Baseline for Bayesian Uncertainty in Deep Learning**

Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, Andrew Gordon Wilson.

A simple method for approximating the true posterior over neural network weights. Somebody needs to see how this does on the Deep Bayesian Bandits Showdown.

**Scalable Global Optimization via Local Bayesian Optimization**

David Eriksson, Michael Pearce, Jacob R Gardner, Ryan Turner, Matthias Poloczek

Bayesian optimization often breaks down in high-dimensional spaces because samples are too sparse or the uncertainty is too large near the edges of the search space, leading to over-exploration. Trust-region algorithms overcome this by limiting the search to an area around the current best solution and then fitting very simple models (linear, quadratic) in that region. The region grows or expands based on whether evaluations improve the optimum. Logically, fitting a GP and doing BO within the trust region performs much better.

**Learning to control self-assembling morphologies via modular co-evolution of control and morphology**

Deepak Pathak, Chris Lu, Trevor Darrell, Phillip Isola, Alexei A. Efros
Just as biology began with simple structures (cells) and gained physical complexity in parallel with behavioral complexity, it can be easier to evolve a control policy in parallel with the thing being controlled. For example, we can learn a behavioral policy for simple magnetic arms that allows them to assemble in order to perform complex tasks.

**Guided meta-policy search**

Russell Mendonca, Abhishek Gupta, Rosen Kralev, Pieter Abbeel, Sergey Levine, Chelsea Finn

Naively training a meta reinforcement learning policy that does well with few examples on new tasks requires an intractable number of samples. Instead, we can train on each of the training tasks individually, and then train the meta-policy using the individual policies as supervision signals.

**Weight agnostic neural networks**

Adam Gaier, David Ha

Optimize over architectures instead over weights finds networks that are surprisingly good at RL and image classification even when all their weights are the same.

I taught high school math and physics for three years as a Teach for America (TFA) corps member between college and grad school. I have mixed opinions about TFA and its role in American education, but I recently did a small interview for them, and some of the questions were about how I use skills from my time in the classroom in my current career.

Jumping into first-year PhD classes after taking three years off was definitely a big challenge. Even though I had worked in Frances Arnold’s lab as a summer researcher while teaching, it turns out that first-year classes don’t have much to do with your PhD research, and having my undergraduate material on thermodynamics and fluid mechanics fresher would definitely have made things easier. As I got deeper into research and now into my first scientist job, transferable skills became clearer.

While there were some long nights finishing problem sets or debugging code, on average I worked much less as a graduate student than I did as a teacher. For context, I often got to my school before 7am, taught 8am - 3pm, had meetings or coached robotics or caught up on grading until 6pm, then prepped for 2+ hours at home. There’s not much pressure like knowing you have 25 9th-graders waiting for you at 8am, and that any time you don’t have planned they’ll happily fill in with unproductive shenanigans, and then later they’ll be resentful if you do plan well and they don’t have time for unproductive shenanigans.

Being required to perform at a high level in the classroom every day pushed me to use my planning time well, to prioritize, and to develop sustainable procedures. While I don’t need to have such important outputs each and every day as a scientist planning and prioritizing are definitely essential when there’s so many interesting problems to solve, and different managers may need different deliverables on different timescales. Giving continual, steady effort is also an important skill during a PhD, especially if you work in a large group with a busy professor who won’t check in on you much but will expect results when she does.

The first procedure I felt like I really mastered was designing assessments that I could grade quickly. This allowed me to spend more time on everything else while also providing students with the quickest possible concrete feedback on assignments. This carried over directly the TAing in grad school, as the primary responsibility in most of my classes was writing and grading assignments. As a TA, I also led discussions, held office hours, and gave an occasional lecture. Planning and executing these sessions directly used my experience planning and executing these sessions for high schoolers.

In addition to formal teaching responsibilities, graduate students also mentor undergraduate researchers and newer students looking to join their group. As an industry scientist, I still spend time mentoring and teaching newer scientists and receive weekly feedback from one or more managers. As a new teacher, my administration and more experienced teachers invested a lot of time in observing my lessons and then giving me useful, actionable feedback. This feedback was pretty much the only reason I became a competent instructor after my first year.

As a graduate student, I mentored one student who then used the experience to win an NSF Graduate Research Fellowship, and another who has since earned internships with Microsoft. While both these students were brilliant when I met them, I’d like to think I helped them along.

It doesn’t matter how good your work is if nobody understands it. In Frances’s group, students gave a formal group meeting presentation about their ongoing work twice a year. At my current company, I give a variety of updates to the machine learning team or the whole company. My classroom experience taught me to communicate complicated topics to an audience with a range of backgrounds. I know how to check for understanding to make sure I’m not boring people or leaving anybody behind. I definitely don’t get nervous about speaking in front of people anymore after doing it every day for 3 years.

Furthermore, making slides is an art. I got really good at making animations for physics and for stepping through math (because my handwriting is bad and it’s hard to watch your classroom while writing on the board). I still make fancy slides, but now it’s to explain machine learning to biologists and biologists to machine learning people.

TFA’s mission is to end educational inequity. It turns out that, even outside the world of K-12 education, there are a lot of inequitable outcomes, and it’s very hard to unsee once you know to look. This is directly applicable when I’m helping to write recommendation letters for students, or involved in hiring at a machine-learning startup.

]]>