Impressions LREC Marseille 2022
LREC 2022 was a very special conference for me in two ways. First, it was my first conference that was not mainly for computer scientists, and second I could publish the Kassel State of Fluency dataset, which I would think of as one of the major contributions of my Ph.D. I was surprised by the number of in-person participants that came from a large number of fields. It is a very interesting conference that truly allows inter-disciplinary exchange, as you are confronted with different experiences and views on many concepts. I was convinced that my dataset mainly contributes to machine learning for health, but talking to people, I got more and more convinced that the dataset is also useful to therapists and clinical linguists. The community seems to be very tight-knit, and it was easy to notice that especially the older participants have known each other for a really long time but are still open to receiving newcomers to the community. It was nice for networking, and it was easy to make a lot of new contacts.
The main focus of the conference is on language resources, and as such, it is a true treasure trove. The only downside to this is that some papers lack rigor in their ML evaluation, as a lot of non-ML-people dip their foot into ML with their data and sometimes lack proper methods and experience to do a good evaluation.
The full proceedings are available online. You can check them out here.
Personal paper highlights
The papers here showcase the plurality of papers and are in no particular order. I found those and a number of other papers very interesting.
VoxCommunis: A Corpus for Cross-linguistic Phonentic Analysis
Authors: Emily P. Ahn, Eleanor Chodroff
- derived from the Mozilla Common Voice Corpus
- contains acoustic models, pronunciation lexicons, and word- and phone-level alignments
- data from 36 languages
- The corpus also contains acoustic-phonetic measurements
Common Phone: A Multilingual Dataset for Robust Acoustic Modelling
Authors: Philipp Klumpp, Tomas Arias, Paula Andrea Pérez-Toro, Elmar Noeth and Juan Orozco-Arroyave
- a curated version of common voice
- only clean, good quality audio, balanced w.r.t. gender microphone/speaker
DATASET OF STUDENT SOLUTIONS TO ALGORITHM AND DATA STRUCTURE
Authors: Fynn Petersen-Frey, Marcus Soll, Louis Kobras, Melf Johannsen, Peter Kling and Chris Biemann
- solutions to student programming exercises
- could be useful to detect certain code patterns/antipatterns
- developer intent detection
Author: Tatu Ylonen
- open source python / lua package
- data in json format
ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and Evaluation
Authors: Peter Polák, Muskaan Singh, Anna Nedoluzhko and Ondřej Bojar
- tool for meeting annotation, alignment, and evaluation.
- summarization is a hard problem in multi-party meetings
- interface for fast annotation while mitigating the risk of introducing errors
- evaluation mode, quality evaluation of meeting minutes.
- open source, installable from PyPI
- they also published a dataset created with the tool at LREC: ELITR Minuting Corpus: A Novel Dataset for Automatic Minuting from Multi-Party Meetings in English and Czech
Elderly Conversational Speech Corpus with Cognitive Impairment Test and Pilot Dementia Detection Experiment Using Acoustic Characteristics of Speech in Japanese Dialects
Authors: Meiko Fukuda, Ryota Nishimura, Maina Umezawa, Kazumasa Yamamoto, Yurie Iribe, and Norihide Kitaoka
- recorded conversations of 128 elderly people with interviewers
- interviewers also administered the Hasegawa’s Dementia Scale-Revised (HDS-R), a cognitive impairment test
- Samrómur Children: An Icelandic Speech Corpus
- Authors: Carlos Daniel Hernandez Mena, David Erik Mollberg, Michal Borský and Jón Guðnason
- 131 hours of read speech from Icelandic children aged between 4 to 17 years