Impressions LREC Marseille 2022


LREC 2022 was a very special conference for me in two ways. First, it was my first conference that was not mainly for computer scientists, and second I could publish the Kassel State of Fluency dataset, which I would think of as one of the major contributions of my Ph.D. I was surprised by the number of in-person participants that came from a large number of fields. It is a very interesting conference that truly allows inter-disciplinary exchange, as you are confronted with different experiences and views on many concepts. I was convinced that my dataset mainly contributes to machine learning for health, but talking to people, I got more and more convinced that the dataset is also useful to therapists and clinical linguists. The community seems to be very tight-knit, and it was easy to notice that especially the older participants have known each other for a really long time but are still open to receiving newcomers to the community. It was nice for networking, and it was easy to make a lot of new contacts.

The main focus of the conference is on language resources, and as such, it is a true treasure trove. The only downside to this is that some papers lack rigor in their ML evaluation, as a lot of non-ML-people dip their foot into ML with their data and sometimes lack proper methods and experience to do a good evaluation.


The full proceedings are available online. You can check them out here.

Personal paper highlights

The papers here showcase the plurality of papers and are in no particular order. I found those and a number of other papers very interesting.

VoxCommunis: A Corpus for Cross-linguistic Phonentic Analysis

Authors: Emily P. Ahn, Eleanor Chodroff

  • derived from the Mozilla Common Voice Corpus
  • contains acoustic models, pronunciation lexicons, and word- and phone-level alignments
  • data from 36 languages
  • The corpus also contains acoustic-phonetic measurements

Common Phone: A Multilingual Dataset for Robust Acoustic Modelling

Authors: Philipp Klumpp, Tomas Arias, Paula Andrea Pérez-Toro, Elmar Noeth and Juan Orozco-Arroyave

  • a curated version of common voice
  • only clean, good quality audio, balanced w.r.t. gender microphone/speaker


Authors: Fynn Petersen-Frey, Marcus Soll, Louis Kobras, Melf Johannsen, Peter Kling and Chris Biemann

  • solutions to student programming exercises
  • could be useful to detect certain code patterns/antipatterns
  • autograder
  • developer intent detection


Author: Tatu Ylonen

  • open source python / lua package
  • data in json format
  • categories
  • lemmas
  • pronunciations

ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and Evaluation

Authors: Peter Polák, Muskaan Singh, Anna Nedoluzhko and Ondřej Bojar

  • tool for meeting annotation, alignment, and evaluation.
  • summarization is a hard problem in multi-party meetings
  • interface for fast annotation while mitigating the risk of introducing errors
  • evaluation mode, quality evaluation of meeting minutes.
  • open source, installable from PyPI
  • they also published a dataset created with the tool at LREC: ELITR Minuting Corpus: A Novel Dataset for Automatic Minuting from Multi-Party Meetings in English and Czech

Elderly Conversational Speech Corpus with Cognitive Impairment Test and Pilot Dementia Detection Experiment Using Acoustic Characteristics of Speech in Japanese Dialects

Authors: Meiko Fukuda, Ryota Nishimura, Maina Umezawa, Kazumasa Yamamoto, Yurie Iribe, and Norihide Kitaoka

  • recorded conversations of 128 elderly people with interviewers
  • interviewers also administered the Hasegawa’s Dementia Scale-Revised (HDS-R), a cognitive impairment test
  • Samrómur Children: An Icelandic Speech Corpus
  • Authors: Carlos Daniel Hernandez Mena, David Erik Mollberg, Michal Borský and Jón Guðnason
  • 131 hours of read speech from Icelandic children aged between 4 to 17 years
  • crowdsourced