Impressions LREC Marseille 2022

Impressions

LREC 2022 was a very special conference for me in two ways. First, it was my first conference that was not mainly for computer scientists, and second I could publish the Kassel State of Fluency dataset, which I would think of as one of the major contributions of my Ph.D. I was surprised by the number of in-person participants that came from a large number of fields. It is a very interesting conference that truly allows inter-disciplinary exchange, as you are confronted with different experiences and views on many concepts. I was convinced that my dataset mainly contributes to machine learning for health, but talking to people, I got more and more convinced that the dataset is also useful to therapists and clinical linguists. The community seems to be very tight-knit, and it was easy to notice that especially the older participants have known each other for a really long time but are still open to receiving newcomers to the community. It was nice for networking, and it was easy to make a lot of new contacts.

The main focus of the conference is on language resources, and as such, it is a true treasure trove. The only downside to this is that some papers lack rigor in their ML evaluation, as a lot of non-ML-people dip their foot into ML with their data and sometimes lack proper methods and experience to do a good evaluation.

Proceedings:

The full proceedings are available online. You can check them out here.

Personal paper highlights

The papers here showcase the plurality of papers and are in no particular order. I found those and a number of other papers very interesting.

VoxCommunis: A Corpus for Cross-linguistic Phonentic Analysis

Authors: Emily P. Ahn, Eleanor Chodroff

  • derived from the Mozilla Common Voice Corpus
  • contains acoustic models, pronunciation lexicons, and word- and phone-level alignments
  • data from 36 languages
  • The corpus also contains acoustic-phonetic measurements

Common Phone: A Multilingual Dataset for Robust Acoustic Modelling

Authors: Philipp Klumpp, Tomas Arias, Paula Andrea Pérez-Toro, Elmar Noeth and Juan Orozco-Arroyave

  • a curated version of common voice
  • only clean, good quality audio, balanced w.r.t. gender microphone/speaker

DATASET OF STUDENT SOLUTIONS TO ALGORITHM AND DATA STRUCTURE

Authors: Fynn Petersen-Frey, Marcus Soll, Louis Kobras, Melf Johannsen, Peter Kling and Chris Biemann

  • solutions to student programming exercises
  • could be useful to detect certain code patterns/antipatterns
  • autograder
  • developer intent detection

Wiktextract

Author: Tatu Ylonen

  • open source python / lua package
  • data in json format
  • categories
  • lemmas
  • pronunciations

ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and Evaluation

Authors: Peter Polák, Muskaan Singh, Anna Nedoluzhko and Ondřej Bojar

  • tool for meeting annotation, alignment, and evaluation.
  • summarization is a hard problem in multi-party meetings
  • interface for fast annotation while mitigating the risk of introducing errors
  • evaluation mode, quality evaluation of meeting minutes.
  • open source, installable from PyPI
  • they also published a dataset created with the tool at LREC: ELITR Minuting Corpus: A Novel Dataset for Automatic Minuting from Multi-Party Meetings in English and Czech

Elderly Conversational Speech Corpus with Cognitive Impairment Test and Pilot Dementia Detection Experiment Using Acoustic Characteristics of Speech in Japanese Dialects

Authors: Meiko Fukuda, Ryota Nishimura, Maina Umezawa, Kazumasa Yamamoto, Yurie Iribe, and Norihide Kitaoka

  • recorded conversations of 128 elderly people with interviewers
  • interviewers also administered the Hasegawa’s Dementia Scale-Revised (HDS-R), a cognitive impairment test
  • Samrómur Children: An Icelandic Speech Corpus
  • Authors: Carlos Daniel Hernandez Mena, David Erik Mollberg, Michal Borský and Jón Guðnason
  • 131 hours of read speech from Icelandic children aged between 4 to 17 years
  • crowdsourced