Clara Meister

Clara Meister

Postdoc in Computer Science

EPFL

About Me

I’m a postdoc at EPFL in Antoine Bosselut’s NLP Lab, working on tokenization and methods for multilingual language modeling, with a focus on low-resource settings. I also continue to lecture for the MAS in AI and Digital Technology continuing studies programme at ETH Zürich. Previously, I did my PhD at ETH, supported by a Google PhD Fellowship. A large portion of my research has been on natural language generation—specifically, on decoding methods for probabilistic models. Lately, I’m particularly interested in tokenization strategies for language models and cross-lingual fairness in tokenization. See my Google Scholar page for an up-to-date list of publications. Feel free to reach out if you’d like to discuss research!

I’m also a co-founder and organizer of Zürich AI, Switzerland’s largest machine learning meetup, which hosts regular events bridging industry and academia at the ETH AI Center.

In my free time, I like to trail run, cook and drink wine (only the last two are done simultaneously).

Education

PhD in Computer Science, 2024

ETH Zürich

MSc in Computational and Mathematical Engineering, 2018

Stanford University

BSc in Mathematical and Computational Science, 2017

Stanford University

Blog Posts

UnigramLM: An Attempt at Writing The Missing Manual

March 16, 2026
This post is my attempt to write down the UnigramLM tokenization algorithm cleanly and explicitly because no such derivation appears to exist and I think understanding the theory behind the method …
Publications

I do not keep this updated; see my Google Scholar page for an up-to-date list of publications.

(2023). A Formal Perspective on Byte-Pair Encoding. Findings of the Association for Computational Linguistics: ACL 2023.
URL
(2023). A Measure-theoretic Characterzation of Tight Language Model. Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
URL
(2023). On the Efficacy of Sampling Adapters. Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
(2023). Tokenization and the Noiseless Channel. Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
URL
(2023). On the Usefulness of Embeddings, Clusters and Strings for Text Generation Evaluation. Proceedings of the 11th International Conference on Learning Representations.
URL