Biography

I’m a postdoc at ETH Zürich in the computer science department, working with Prof. Bernd Gärtner to design a Masters of Advanced Studies in AI and Data Technologies. Previously, I did my PhD at ETH supported by a Google PhD Fellowship in natural language processing. A large portion of my research in the last years has been on natural language generation—specifically, on decoding methods for probabilistic models. While my job is no longer research-related, I still occasionally work on research projects; see my Google Scholar page for an up-to-date list of publications. Lately, I’m particularly interested in tokenization strategies for language models. Feel free to reach out if you’d like to discuss research!

Education

PhD in Computer Science, 2024

ETH Zürich
MSc in Computational and Mathematical Engineering, 2018

Stanford University
BSc in Mathematical and Computational Science, 2017

Stanford University

Experience

Postdoc

ETH Zürich

September 2020 – Present Zürich, Switzerland

Postdoc designing an AI project course for the AI and Data Technologies MAS degree.

PhD Student

ETH Zürich

February 2020 – November 2024 Zürich, Switzerland

PhD student in the Computer Science Department (Machine Learning Institute). Helped to design and teach the Natural Language Processing course and develop material for the Large Language Models course. Advised 13 MsC theses and various semester research projects.

Research Scientist Intern

DeepMind

March 2022 – August 2022 London, UK

Intern with the Language Team, working with Adhi Kuncoro, Wojciech Stokowiec, and Laura Rimell.

Research Assistant

ETH Zürich

January 2019 – February 2020 Zürich, Switzerland

Research assistant in the Advanced Software Technologies Lab under Professor Zhendong Su. Area of focus was on building systems for automatically testing machine translation systems.

Research Assistant

SLAC National Accelerator Laboratory

January 2018 – June 2018 Stanford, California

Research assistant with the Linac Coherent Light Source lab working with scientists to modify and enhance code base for analysis of electron pulse x-ray images generated at SLAC’s new hard x-ray free-electron laser.

Course Assistant

Stanford University Computer Science Department

September 2017 – December 2017 Stanford, California

Course assistant for CS102 Big Data- Tools and Techniques, Discoveries and Pitfalls. Taught students basic data wrangling and analysis along with visualization techniques using multiple software platforms.

Data Science Intern

Akamai Technologies

September 2016 – June 2017 Mountain View, California

Part-time internship during academic school year. Projects included integrating AWS ElastiCache into infrastructure of the Data Science Team’s services and refactoring libraries to support multiple databases.

Software Engineering Intern

Cisco Systems; Tendril Networks; SAP Hybris

June 2015 – September 2017 San Jose, California; Boulder, Colorado

Summer software engineering internships 2015-2017

Publications

I do not keep this updated; see my Google Scholar page for an up-to-date list of publications.

A Formal Perspective on Byte-Pair Encoding

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression …

Vilém Zouhar, Clara Meister, Juan Gastaldi, Li Du, Tim Vieira, Mrinmaya Sachan, Ryan Cotterell

A Measure-theoretic Characterzation of Tight Language Model

Language modeling, a central task in natural language processing, involves estimating a probability distribution over strings. In most …

Li Du, Lucas Torroba Hennigen, Tiago Pimentel, Clara Meister, Jason Eisner, Ryan Cotterell

On the Efficacy of Sampling Adapters

Sampling-based decoding strategies are widely employed for generating text from probabilistic models, yet standard ancestral sampling …

Clara Meister, Tiago Pimentel, Luca Malagutti, Ryan Cotterell

Tokenization and the Noiseless Channel

Subword tokenization is a key part of most NLP pipelines.However, little is known about why some tokenizer and hyperparameter …

Vilém Zouhar, Clara Meister, Juan Gastaldi, Li Du, Mrinmaya Sachan, Ryan Cotterell

On the Usefulness of Embeddings, Clusters and Strings for Text Generation Evaluation

A good automatic evaluation metric for language generation ideally correlates highly with human judgements of text quality. Yet, there …

Tiago Pimentel, Clara Meister, Ryan Cotterell

See all publications

Recent & Upcoming Talks

18/5/2023: Invited talk at Tel Aviv University’s NLP Seminar
21/4/2023: Invited talk at University of Edinburgh ILCC Seminar
26/9/2022: Invited lecture at Johns Hopkins University Artificial Agents course
6/5/2022: Invited talk at Google Research Machine Translation Team Reading Group
14/3/2022: Invited talk at IST-Unbabel Seminar
10/8/2021: Invited talk at UT Austin’s NLP Seminar
14/7/2021: Invited talk at Berkeley’s NLP Seminar
16/6/2021: Invited talk at MIT’s Computational Psycholinguistics Lab