Cross-lingual Information Extraction for the Assessment and Prevention of Adverse Drug Reactions

Lisa Raithel

Résumé

Die in dieser Dissertation beschriebene Arbeit befasst sich mit der mehrsprachigen Erkennung und Extraktion von unerwünschten Arzneimittelwirkungen in biomedizinischen Texten, die von Laien verfasst wurden. Ich beschreibe die Erstellung eines neuen dreisprachigen Korpus (Deutsch, Französisch, Japanisch) mit Schwerpunkt auf Deutsch und Französisch, einschließlich der Entwicklung von Annotationsrichtlinien, die für alle Sprachen gelten und sich an nutzergenerierten Texten orientieren. Weiterhin dokumentiere ich den Annotationsprozess und gebe einen Überblick über den resultierenden Datensatz. Anschließend gehe ich auf den Schutz der Privatsphäre der Nutzer in Bezug auf Daten über Gesundheitsprobleme ein. Ich präsentiere einen Prototyp zu einer Studie darüber, wie Nutzer reagieren, wenn sie direkt nach ihren Erfahrungen mit Nebenwirkungen befragt werden. Die Studie zeigt, dass die meisten Menschen nichts dagegen haben, ihre Erfahrungen zu schildern, wenn sie um Erlaubnis gefragt werden. Allerdings kann die Datenerhebung darunter leiden, dass der Fragebogen zu viele Fragen enthält. Als nächstes analysiere ich die Ergebnisse einer zweiten potenziellen Methode zur Datenerhebung in sozialen Medien, der synthetischen Generierung von Pseudo-Tweets, die auf echten Twitter-Nachrichten basieren. In der Analyse konzentriere ich mich auf die Herausforderungen, die dieser Ansatz mit sich bringt, und zeige, dass trotz einer vorläufigen Bereinigung noch Probleme in den Übersetzungen zu finden sind, sowohl was die Bedeutung des Textes als auch die annotierten Tags betrifft. Ich gebe daher anekdotische Beispiele dafür, was bei einer maschinellen Übersetzung schiefgehen kann, fasse die gewonnenen Erkenntnisse zusammen und stelle potenzielle Verbesserungsmaßnahmen vor. Weiterhin präsentiere ich experimentelle Ergebnisse für die Klassifizierung mehrsprachiger Dokumente bezüglich medizinischer Nebenwirkungen im Englischen und Deutschen. Dazu wurden Klassifikationsmodelle an verschiedenen Datensatzkonfigurationen verfeinert (fine-tuning), zunächst an englischen und dann an deutschen Dokumenten. Dieser Ansatz wurde durch das starke Ungleichgewicht der Labels in den beiden Datensätzen verkompliziert. Die Ergebnisse zeigen, dass die Einarbeitung englischer Trainingsdaten bei der Klassifizierung relevanter deutscher Dokumente hilft, aber nicht ausreicht, um das natürliche Ungleichgewicht der Dokumentenklassen wirksam abzuschwächen. Dennoch scheinen die entwickelten Modelle vielversprechend zu sein und könnten besonders nützlich sein, um weitere Texte zu sammeln. Dieser wiederum können das aktuelle Korpus erweitern und damit die Erkennung relevanter Dokumente für andere Sprachen verbessern. Nachfolgend beschreibe ich die Teilnahme am n2c2 2022 Shared Task zur Erkennung von Medikamenten. Die Ansätze des Shared Task werden anschließend vom Englischen auf deutsche, französische und spanische Korpora ausgeweitet, indem Datensätze aus verschiedenen Teilbereichen verwendet werden, die auf unterschiedlichen Annotationsrichtlinien basieren. Ich zeige, dass die mehrsprachige Übertragung gut funktioniert, aber auch stark von den Annotationstypen und Definitionen abhängt. Im Anschluss verwende ich die besprochenen Modelle erneut, um einige vorläufige Ergebnisse für das vorgestellte Korpus zu zeigen, zunächst nur für die Erkennung von Medikamenten und dann für alle Arten von annotierten Entitäten. Die experimentellen Ergebnisse zeigen, dass die Medikamentenerkennung vielversprechende ist, insbesondere wenn man bedenkt, dass die Modelle an Daten aus einem anderen Teilbereich verfeinert und mit einem zeroshot Ansatz auf die neuen Daten angewendet wurden. In Bezug auf die Erkennung anderer medizinischer Ausdrücke stellt sich heraus,dass die Leistung der Modelle stark von der Art der Entität abhängt. Ich schlage deshalb Möglichkeiten vor, wie man dieses Problem in Zukunft angehen könnte.

The work described in this thesis deals with the cross- and multi-lingual detection and extraction of adverse drug reactions in biomedical texts written by laypeople. This includes the design and creation of a multi-lingual corpus, exploring ways to collect data without harming users' privacy and investigating whether cross-lingual data can mitigate class imbalance in document classification. It further addresses the question of whether zero- and cross-lingual learning can be successful in medical entity detection across languages. I describe the creation of a new tri-lingual corpus (German, French, Japanese) focusing on German and French, including the development of annotation guidelines applicable to any language and oriented towards user-generated texts. I further describe the annotation process and give an overview of the resulting dataset. The data is provided with annotations on four levels: document-level, for describing if a text contains ADRs or not; entity level for capturing relevant expressions; attribute level to further specify these expressions; The last level annotates relations to extract information on how the aforementioned entities interact. I then discuss the topic of user privacy in data about health-related issues and the question of how to collect such data for research purposes without harming the person's privacy. I provide a prototype study of how users react when they are directly asked about their experiences with ADRs. The study reveals that most people do not mind describing their experiences if asked, but that data collection might suffer from too many questions in the questionnaire. Next, I analyze the results of a potential second way of collecting social media data: the synthetic generation of pseudo-tweets based on real Twitter messages. In the analysis, I focus on the challenges this approach entails and find, despite some preliminary cleaning, that there are still problems to be found in the translations, both with respect to the meaning of the text and the annotated labels. I, therefore, give anecdotal examples of what can go wrong during automatic translation, summarize the lessons learned, and present potential steps for improvements. Subsequently, I present experimental results for cross-lingual document classification with respect to ADRs in English and German. For this, I fine-tuned classification models on different dataset configurations first on English and then on German documents, complicated by the strong label imbalance of either language's dataset. I find that incorporating English training data helps in the classification of relevant documents in German, but that it is not enough to mitigate the natural imbalance of document labels efficiently. Nevertheless, the developed models seem promising and might be particularly useful for collecting more texts describing experiences about side effects to extend the current corpus and improve the detection of relevant documents for other languages. Next, I describe my participation in the n2c2 2022 shared task of medication detection which is then extended from English to German, French and Spanish using datasets from different sub-domains based on different annotation guidelines. I show that the multi- and cross-lingual transfer works well but also strongly depends on the annotation types and definitions. After that, I re-use the discussed models to show some preliminary results on the presented corpus, first only on medication detection and then across all the annotated entity types. I find that medication detection shows promising results, especially considering that the models were fine-tuned on data from another sub-domain and applied in a zero-shot fashion to the new data. Regarding the detection of other medical expressions, I find that the performance of the models strongly depends on the entity type and propose ways to handle this. Lastly, the presented work is summarized and future steps are discussed.

Les travaux décrits dans cette thèse portent sur la détection et l'extraction trans- et multilingue des effets indésirables des médicaments dans des textes biomédicaux rédigés par des non-spécialistes. Dans un premier temps, je décris la création d'un nouveau corpus trilingue (allemand, français, japonais), centré sur l'allemand et le français, ainsi que le développement de directives, applicables à toutes les langues, pour l'annotation de contenus textuels produits par des utilisateurs de médias sociaux. Enfin, je décris le processus d'annotation et fournis un aperçu du jeu de données obtenu. Dans un second temps, j'aborde la question de la confidentialité en matière d'utilisation de données de santé à caractère personnel. Enfin, je présente un prototype d'étude sur la façon dont les utilisateurs réagissent lorsqu'ils sont directement interrogés sur leurs expériences en matière d'effets indésirables liés à la prise de médicaments. L'étude révèle que la plupart des utilisateurs ne voient pas d'inconvénient à décrire leurs expériences quand demandé, mais que la collecte de données pourrait souffrir de la présence d'un trop grand nombre de questions. Dans un troisième temps, j'analyse les résultats d'une potentielle seconde méthode de collecte de données sur les médias sociaux, à savoir la génération automatique de pseudo-tweets basés sur des messages Twitter réels. Dans cette analyse, je me concentre sur les défis que cette approche induit. Je conclus que de nombreuses erreurs de traduction subsistent, à la fois au niveau du sens du texte et des annotations. Je résume les leçons apprises et je présente des mesures potentielles pour améliorer les résultats. Dans un quatrième temps, je présente des résultats expérimentaux de classification translingue de documents, en anglais et en allemand, en ce qui concerne les effets indésirables des médicaments. Pour ce faire, j'ajuste les modèles de classification sur différentes configurations de jeux de données, d'abord sur des documents anglais, puis sur des documents allemands. Je constate que l'incorporation de données d'entraînement anglaises aide à la classification de documents pertinents en allemand, mais qu'elle n'est pas suffisante pour atténuer efficacement le déséquilibre naturel des classes des documents. Néanmoins, les modèles développés semblent prometteurs et pourraient être particulièrement utiles pour collecter davantage de textes, afin d'étendre le corpus actuel et d'améliorer la détection de documents pertinents pour d'autres langues. Dans un cinquième temps, je décris ma participation à la campagne d'évaluation n2c2 2022 de détection des médicaments qui est ensuite étendue de l'anglais à l'allemand, au français et à l'espagnol, utilisant des ensembles de données de différents sous-domaines. Je montre que le transfert trans- et multilingue fonctionne bien, mais qu'il dépend aussi fortement des types d'annotation et des définitions. Ensuite, je réutilise les modèles mentionnés précédemment pour mettre en évidence quelques résultats préliminaires sur le corpus présenté. J'observe que la détection des médicaments donne des résultats prometteurs, surtout si l'on considère que les modèles ont été ajustés sur des données d'un autre sous-domaine et appliqués sans réentraînement aux nouvelles données. En ce qui concerne la détection d'autres expressions médicales, je constate que la performance des modèles dépend fortement du type d'entité et je propose des moyens de gérer ce problème. Enfin, les travaux présentés sont résumés, et des perspectives sont discutées.

Sprachübergreifende Informationsextraktion zur Erkennung und Prävention medizinischer Nebenwirkungen

Cross-lingual Information Extraction for the Assessment and Prevention of Adverse Drug Reactions

Extraction d'information translingue pour l'évaluation et la prévention d'effets indésirables de médicaments

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager