Impressions LREC Marseille 2022
Impressions
LREC 2022 was a very special conference for me in two ways. First, it was my first conference that was not mainly for computer scientists, and second I could publish the Kassel State of Fluency dataset, which I would think of as one of the major contributions of my Ph.D. I was surprised by the number of in-person participants that came from a large number of fields. It is a very interesting conference that truly allows inter-disciplinary exchange, as you are confronted with different experiences and views on many concepts. I was convinced that my dataset mainly contributes to machine learning for health, but talking to people, I got more and more convinced that the dataset is also useful to therapists and clinical linguists. The community seems to be very tight-knit, and it was easy to notice that especially the older participants have known each other for a really long time but are still open to receiving newcomers to the community. It was nice for networking, and it was easy to make a lot of new contacts.
The main focus of the conference is on language resources, and as such, it is a true treasure trove. The only downside to this is that some papers lack rigor in their ML evaluation, as a lot of non-ML-people dip their foot into ML with their data and sometimes lack proper methods and experience to do a good evaluation.
Proceedings:
The full proceedings are available online. You can check them out here.
Personal paper highlights
The papers here showcase the plurality of papers and are in no particular order. I found those and a number of other papers very interesting.
VoxCommunis: A Corpus for Cross-linguistic Phonentic Analysis
Authors: Emily P. Ahn, Eleanor Chodroff
- derived from the Mozilla Common Voice Corpus
- contains acoustic models, pronunciation lexicons, and word- and phone-level alignments
- data from 36 languages
- The corpus also contains acoustic-phonetic measurements
Common Phone: A Multilingual Dataset for Robust Acoustic Modelling
Authors: Philipp Klumpp, Tomas Arias, Paula Andrea Pérez-Toro, Elmar Noeth and Juan Orozco-Arroyave
- a curated version of common voice
- only clean, good quality audio, balanced w.r.t. gender microphone/speaker
DATASET OF STUDENT SOLUTIONS TO ALGORITHM AND DATA STRUCTURE
Authors: Fynn Petersen-Frey, Marcus Soll, Louis Kobras, Melf Johannsen, Peter Kling and Chris Biemann
- solutions to student programming exercises
- could be useful to detect certain code patterns/antipatterns
- autograder
- developer intent detection
Wiktextract
Author: Tatu Ylonen
- open source python / lua package
- data in json format
- categories
- lemmas
- pronunciations
ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and Evaluation
Authors: Peter Polák, Muskaan Singh, Anna Nedoluzhko and Ondřej Bojar
- tool for meeting annotation, alignment, and evaluation.
- summarization is a hard problem in multi-party meetings
- interface for fast annotation while mitigating the risk of introducing errors
- evaluation mode, quality evaluation of meeting minutes.
- open source, installable from PyPI
- they also published a dataset created with the tool at LREC: ELITR Minuting Corpus: A Novel Dataset for Automatic Minuting from Multi-Party Meetings in English and Czech
Elderly Conversational Speech Corpus with Cognitive Impairment Test and Pilot Dementia Detection Experiment Using Acoustic Characteristics of Speech in Japanese Dialects
Authors: Meiko Fukuda, Ryota Nishimura, Maina Umezawa, Kazumasa Yamamoto, Yurie Iribe, and Norihide Kitaoka
- recorded conversations of 128 elderly people with interviewers
- interviewers also administered the Hasegawa’s Dementia Scale-Revised (HDS-R), a cognitive impairment test
- Samrómur Children: An Icelandic Speech Corpus
- Authors: Carlos Daniel Hernandez Mena, David Erik Mollberg, Michal Borský and Jón Guðnason
- 131 hours of read speech from Icelandic children aged between 4 to 17 years
- crowdsourced