The number of users of social networking environments is increasing day by day. In parallel with the number of users, new social networking platforms are also taking place on the internet according to the wishes and needs of the users. Social networking environments, which are in an indispensable position with the instinct of socialization, also provide an environment for unconscious personal data disclosures. In this study, the health data disclosed by users in social networks due to lack of awareness has been focused on. By using the data collected from Twitter, it is aimed to identify the tweets that disclose health data. To achieve this purpose tweets collected from Twitter in accordance with search keywords about personal health experiences and annotated by a group of computer engineers. Created corpus preprocessed with natural language processing tool for Turkic languages, named Zemberek, and classified with Fasttext library. With language model created, tweets containing personal health data disclosure were detected with %88 accuracy. The main contributions in this paper are mainly; being the first study to detect personal health data disclosures in Turkish language, creation of Turkish search keywords that will serve as a reference for obtaining data to meet the health data domain, instead of disease-specific approach seen frequently in literature a holistic perspective implemented by collecting tweets containing many distinct keywords about health experiences, and creation of Turkish data corpus by manually annotating around 4.500 tweets in personal health data domain.
Primary Language | English |
---|---|
Subjects | Software Engineering (Other) |
Journal Section | Research Article |
Authors | |
Publication Date | June 30, 2022 |
Submission Date | May 9, 2022 |
Published in Issue | Year 2022 Volume: 11 Issue: 2 |