Discovering the same job ads expressed with the different sentences by using hybrid clustering algorithms

Text mining studies on job ads have become widespread in recent years to determine the qualifications required for each position. It can be said that the researches made for Turkish are limited while a large resource pool is encountered for the English language. Kariyer.Net is the biggest company for the job ads in Turkey and 99% of the ads are Turkish. Therefore, there is a necessity to develop novel Natural Language Processing (NLP) models in Turkish for analysis of this big database. In this study, the job ads of Kariyer.Net have been analyzed, and by using a hybrid clustering algorithm, the hidden associations in this dataset as the big data have been discovered. Firstly, all ads in the form of HTML codes have been transformed into regular sentences by the means of extracting HTML codes to inner texts. Then, these inner texts containing the core ads have been converted into the sub ads by traditional methods. After these NLP steps, hybrid clustering algorithms have been used and the same ads expressed with the different sentences could be managed to be detected. For the analysis, 57 positions about Information Technology sectors with 6,897 ad texts have been focused on. As a result, it can be claimed that the clusters obtained contain useful outcomes and the model proposed can be used to discover common and unique ads for each position.


Introduction
Today, all sectors have witnessed a continuous transformation and, like many other fields, could not escape the digitization period. Although the digitization of the job market has led to positive improvements in the interaction of recruiters and candidates, the amount of data (job ads) produced every day has become so large that it has become impossible to manually examine them. With the increasing number of electronic documents and the rapid growth of the internet, the task of automatically categorizing documents has become the critical method for detecting and preparing information for usages. Machine learning algorithms can offer a sustainable solution. In fact, the most important benefits of this technology are speed and efficiency. If we go back to the studies for English ads, we see that in the study conducted in 2010, the extraction of the qualities started to be realized with a linguistic approach, namely natural language processing methods (NLP) [1]. Then we can see that machine learning algorithms are used. For example, in a doctoral dissertation in 2018, skill mining was carried out in job ads using Regression and Artificial Neural Networks besides NLP, and suitable candidates were matched [2]. In some studies, it has been observed that it is made dependent on the field. For example, in the study in 2019, "skill recognition" was carried out on the German language and only in Computer Science with machine learning algorithms such as supervised learning algorithms like Random Trees [3].
It can be said that the researches made for Turkish are limited while a large resource pool is encountered for the English language. In this study, up-to-date Turkish job ads having a large size have been analyzed by machine learning algorithms and the same ads expressed with the different sentences could be managed to be detected.
In this study, the job ads of Kariyer.Net have been analyzed, and by using hybrid clustering algorithm, the hidden associations in this dataset as the big data have been discovered. Firstly, all ads in the form of HTML codes have been transformed into regular sentences by the means of extracting HTML codes to inner texts. Then, these inner texts containing the core ads have been converted into the sub ads by traditional Natural Language Processing (NLP) methods. After these NLP steps, hybrid clustering algorithms have been used and the same ads expressed with the different sentences could be managed to be detected.
The next sections are about related works, methodologies, and experimental studies with the results separately.

Related Works
Recently, rapid changes in business life have forced employees to keep up with these new conditions. Adapting to these changes requires not only technical skills but also many skills and competencies. Analysis studies on job postings were generally used to determine the qualifications needed for many positions. For example, in a study by Kennan et al., They focused on the information systems jobs of Australian employers and IT technology graduates. In the results, it was observed that communication skills and personal characteristics were also very important apart from IT knowledge, skills, and competencies [4]. In his study, Choi and Rasmussen determined the priority qualifications for digital library positions by focusing on academic libraries by examining job postings. In the results obtained, management, communication skills and digital technology competencies were put forward as required qualifications [5]. In another study, Pember concluded that record keeping experts should have experience other than the knowledge and skills they need in their articles. Apart from that, they have achieved that record keeping positions should have proficiency in various areas of information management [6]. For example, record keeping professionals must have skills such as good computer use, well-developed communication, and leadership skills, acquiring personnel management skills and experience, a good level of teamwork and strong customer focus. Abstract concepts such as motivation and enthusiasm for work and personal characteristics such as analytical problem-solving have also become prominent talents. These talents have been revealed in the study conducted in many other positions such as civil engineering [7]. On the other hand, some studies have focused on the competence criteria of the training of employees in any sector, except for the consequences of what qualifications are needed. For example, Kwon Lee and Han have been observed that for the United States labor market, most universities place great emphasis on education with courses that have the most operating systems and hardware content. However, according to the results obtained from the study, it has been concluded that employers do not care so much about these qualities [8]. Yongbeom et al. It also addresses the gap between employers 'needs and universities' IT curricula [9].
In addition, there are many studies analyzing job ads to obtain useful information in decision making with text mining. For instance, data mining can be preferred to identify competencies arising in companies and to promote employment opportunities or career development [10]. The system of classification of business areas has been used and it has been able to be placed in a system that varies from traditional to a flexible structure, e.g., market changes [11] or focused on studies that can make quick decision makings based on the changes observed in the job market [12]. Text mining or document-based analysis requires algorithms such as hidden semantic analysis or average link to obtain meaningful information from large amounts of text data. Compared to manual content analysis, document-based analysis takes less time and is cheaper [13].
Document-based analysis involves several pre-processing steps used to clear text using techniques such as removing unnecessary words and roots of corpus. After the preprocessing, document-based analysis algorithms extract words from the cleared text of job ads. The extracted words are processed by cluster analysis [14], hidden semantic analysis [15][16], classification algorithms such as support vector machines and random trees [17][18], open rules, and hidden Dirichlet allocation algorithms [17]. Documentbased analysis has also been used for various purposes, such as researching consumer perceptions of hotels based on online customer textual texts [19] and using social media such as the texts on Facebook and Twitter to conduct competition [20]. Examples of using social media broadcasts for product planning [21] and big data analysis in the financial sector [22] are also found.
Document-based analysis researches that analyze job ads can also be combined in 3 clusters. The cluster 1 is to use document-based analysis for the analysis of job ads to implement a novel scheme clustered of job ads and compare them with the traditional occupational system clustered [15][16][17]. For instance, Mezzanzanica [12] used document-based analysis to investigate job ads in marketing. The corpus contained the job ads in Italy and the texts were in Italian. The researchers evaluated the ESCO taxonomy about job classification and data mining algorithm on over 1.9 million job ads because of trends and dynamics observed in the evolution of the labor market and identified several potential professions that emerged. The cluster 2 of authors used document-based analysis to increase the caliber of work compliance with potential candidates based on 'time to commute workplace, type of work, hourly wages, and the candidate's skill set [14]. These studies consider that inadequate matching of candidates to job positions can cost organizations significantly; therefore, document-based analysis is needed. The researchers in the cluster 3 implemented document-based algorithms to obtain the work profiles for certain vertical fields [16], information management [23] and big data [24], or horizontal fields such [25][26].
There are two methods for analyzing job ads. One of them is manual content analysis [27] and the other one is automatic text analysis, often called text mining or document-based analysis [28]. Document-based analysis has significant advantages over manual content analysis, such as less time needed for analysis and human resource [29]. In addition, there are examples in the literature that use social media about document-based analysis to gain a competitive advantage in various organizations [30][31]. There are also studies that have been observed to increase marketing efficiency with document-based analysis as well [32][33]. Document-based analysis methods are widely used to analyze information stored on social media websites such as Twitter tweets [34]. Therefore, this study focused on document-based analysis techniques for analyzing job ads on Kariyer.Net platform, which can be interpreted as a social media platform.
Job postings contain unstructured texts that make analysis difficult. To successfully cluster these postings, grammatical and syntax errors must also be addressed, and this is where machine learning algorithms can make the difference. Machine learning algorithms need a lot of appropriate text data to learn before models are created for clustering job postings. One of the first phases to using the corpus is to analysis the data in advance. In other words, having data ready for analysis is a very important step. Most of the available corpus is highly unstructured and contains an incomplete and noisy content. It is necessary to clear the data to obtain healthy inferences and create better algorithms. For example, job posting data, although careful, is not yet standard and can be interpreted as informal. In this case too, spelling errors, bad grammar, URLs, words to be blocked, expressions, etc. The presence of undesirable content, such as, are the usual suspects.
In addition to these, there is an inevitable presence of other HTML commands in the text because it contains HTML for the data obtained from the web. Besides, advertisement data, "Latin", "UTF8" etc. It may be subject to various decoding formats such as. Therefore, it needs to be decoded. By converting information from these complex symbols to simple and easy-to-understand characters, all data are kept in standard coding for better analysis. UTF-8 encoding is widely accepted and recommended for use. Finally, another thing that must be operated is to delete ineffective words and stop words. Commonly found words, ineffective words should be ignored when data analysis needs to be directed to data at the word level. Also, punctuation marks need to be removed. All punctuation marks should be handled according to priorities. For example: ".", ",","?" important punctuation marks, while others are what should be removed [35].

Methodologies
Firstly, all ads in the form of HTML codes transformed into regular sentences by the means of extracting HTML codes to inner texts. This implementation has been coded by the following algorithm.
The K-means++ algorithm is a method based on the main idea that the centre point represents the set. It tends to find global clusters of equal size. According to the mechanism of operation of the K-means++ algorithm, first, k objects are selected to represent the centre or mean of a set.
The remaining objects are included in the clusters to which they are most similar, considering the distance from the clusters' mean values. Then, by calculating the average value of each cluster, new cluster centres are determined, and object centre distances are examined again. Total square error criterion SSE (Summed Squared Error) is most used in the evaluation of the K-means++ clustering method. The clustering result with the lowest SSE value gives the best result. The sum of squares of the distance of objects from the centre points of the cluster in which they are found is calculated by Eq. 1.
Here, the standard Euclidean Distance (ED) between two objects is an object whose x value is in the Ci set, the mi value is the centre point of the Ci set.
The K-means++ algorithm described above works according to the ED criterion on two-dimensional data and is shifted until no object set leaves. However, the structure of this K-means++ algorithm is not suitable for web applications. Since comparing whether there is an object leaving the cluster in every translation will cause time negativity in large data sets, a K-means++ version based on objective function has been preferred and this algorithm has been made to work on multidimensional data to cluster web pages. First, it is ensured that the vector representing each document is called in order since it is not possible to memorize and process all the data. The Cosine Similarity criterion was added to calculate the distance of these vectors to the cluster centres by different methods.
The DBSCAN algorithm is based on revealing the neighbours of data points in two or multi-dimensional space. The database is mostly used in the analysis of spatial data since it deals with a spatial perspective. For the DBSCAN algorithm, the terms core object, Eps, MinPts, direct density accessible point, density accessible point, density bound point are basic concepts. It takes the algorithm, Eps and MinPts values as input parameters. Starting from any object in the database, it checks all objects. If the checked object has already been included in one set, it moves to the other object without any action. If the object has not been previously clustered, it performs a Region Query and finds its neighbours in the Eps neighbourhood. If the number of neighbours is more than MinPts, it will call this object and its neighbours a new cluster. It then finds new neighbours by making a new zone query for each neighbour that is not already clustered. If the neighbour numbers of the points where the region is questioned are more than MinPts, they are included in the cluster. Neighbourhood discovery is the most demanding part of the DBSCAN algorithm.
Performance improvements in this section significantly increase the performance of the algorithm. In the neighbourhood analysis, instead of examining every point, various indexing algorithms such as R * tree or spatial query have been introduced. With these algorithms, the complexity of the DBSCAN algorithm from O (n * log n) to O (log n) can provide significant performance increases. Since the DBSCAN algorithm takes two parameters, Eps and MinPts, it has been applied 7 times with different parameters to see the effect of both parameters on the cluster result. Unlike K-means++, DBSCAN algorithm does not include every element of the database in a cluster, it could filter the exception data. Values determined by the algorithm as noise (exception) are not shown in the result graphs. When very small value is given to Eps neighbourhood distance, only very dense cluster areas, in other words, cluster cores were found. When the EPS value is applied as 0.2, an unwanted small cluster has occurred near the 3rd cluster although very close to the ideal cluster has occurred. SOM consists of two layers of artificial neurons: an input layer and an output layer. The input layer is fed into feature vectors, so it is the same as the number of dimensions of the input feature vector. Output layer, also called the output map, is usually arranged in a regular two-dimensional structure such that there are neighbourhood relations among the neurons. Every neuron in input layer is fully connected to every output neuron, and each connection has a weighting value attached to it. Algorithm: 1. We randomly start the weight values of neurons in our network 2. We get the input vectors. (Our target vectors in the system) 3. All values on the map are roaming and: 4. The distance between the input vector and the current map value is calculated as Euclidean distance. 5. The node with the shortest distance is taken (this method is called the best matching unit (BMU)) 6. All nodes adjacent to this best-fit node we selected are updated and brought closer to the input vector. (The following formula is used):
where t is current step, λ is time limit on step, Wv is current weight vector, D is targeted input value, (T) is neighborhood function (how far to go from the most suitable neighbor) and α(t) is time dependent learning limit The random choosing the data in the training phase of SOM is abandoned and the data are chosen in the same order for each time. The accuracy problem is passed over at connecting the subparts of maps.
This new approach first divides the area into four, and this standard SOM processes it by parallel processing for all small areas. Thus, datasets are split for all processes and complexity is reduced.
iPSOM consists of standard SOM (SSOM). This algorithm is used for 2x2 neurons stably in each phase. iPSOM starts training with SSOM with 2x2 neurons and usage of all dataset. After that, the recursive structure of iPSOM is activated and recursively, SSOM for 2x2 neurons is processed with usage of divided datasets. The following Figure 1 shows the process-flow of iPSOM for 4x4 neurons. There are four parallel maps and they are trained for 2x2 maps. At the end of the iPSOM, a 4x4 map is obtained after combining operation in the same order before splitting. The complexity of SOM is O(N 2 ). However, this formula is obtained by the assumption of the multiply of the map size and weight numbers equal to the multiply of the tuple number and attribute number in the dataset. Therefore, if N 2 is split to N.C as N is the total size of map and C is the total size of dataset, the changes of SOM speed according to the different datasets and its effects appears in more detail. When this formula is split sub-components, these following formulas in Eq. 2 and Eq. 3 are obtained.
where F is the tuple number in the dataset, A is the attribute number for a tuple in the dataset, M is the total neuron number in the map and W is the total number of weight variables for a neuron in the map.
where α is the total time for SSOM algorithm.
When iPSOM is processed for 4x4 neurons (M = 16), SSOM algorithm is processed for 2x2 neurons (M = 4) and all dataset in the first phase. Secondly, the datasets are divided into four pieces for each neuron according to the proximities to the weight values of neurons. All proximities calculations in the algorithm are done by ED formula in Eq. 5.
where X and Y are tuples in the dataset or neurons in the map and d is the attribute number if X and Y are tuples or d is the total number of weight variables if X and Y are neurons.
Four pieces of dataset are used for training by SSOM algorithm in parallel. SSOM is processed for 2x2 neurons (M = 4) for each piece again. After these pieces are trained, these pieces are joined and the map with 4x4 neurons is obtained. The process time calculation is done the following formulas: . . 2 If it is assumed that SSOM is processed for all dataset and 4x4 neuron, process time of SSOM is found by Eq. 4. For 4x4 neurons, iPSOM is processed for 2x2 neurons firstly and the process time of this phase is calculated by Eq. 6. The result shows that this phase is equals to quarter of SSOM. However, iPSOM continues to be trained for 4x4 neurons and in parallel and for four pieces of datasets, iPSOM is processed for 2x2 neurons again. The process time of this phase is calculated by Eq. 7. Totally, the process time is calculated by Eq. 8 and it shows that iPSOM takes less time than SSOM as 5/16 times or nearly 1/3 times. This difference is kept even if the size of the dataset gets larger. Also, if the map size gets larger, new components add to Eq. 8 and because the dataset is divided again and again, these process times are approximate to zero a lot like. Thus, this difference is kept. Theoretically, this is possible; however, the machine, where iPSOM is processed, must have enough numbers of cores to supply the parallel processing.

Experimental Studies and Results
For the analysis, 57 positions about Information Technology sectors with 6,897 ad texts have been focused on. Firstly, by the means of DBSCAN algorithm, 94,246 instances have been grouped to detect the noises. Principal component analysis (PCA) has been used to reduce the 2,682 attributes to 2. The result figure has been given in Figure 2.  Then, X-means algorithm has been used for clustering to obtain the initial centroids for iPSOM with 73,372 instances reduced. A sample of PCA output is given in Figure 3. In this step, by iPSOM algorithm, 5,996 centroids have been obtained. The iPSOM result has 24x24 neurons and it has been given in Figure 4.   has developed an application for android developer with at least 4 years of experience in Kotlin language.  at least 3 years of experience has developed an application with swift language for ios developer  at least 4 years of experience in application development with java  at least 5 years of experience in developing applications using microsoft .net and c #  at least 5 years of experience in developing applications using microsoft .net and c #  large-scale web application development experience using net technologies  at least 3 years of experience in web-based application development with phyton 2 and phyton 3  3 years or more of experience in developing ios mobile applications  at least 3 years of experience in ios application development  … In the first set example, the technical skills needed in various positions could be grouped. In cluster 2, the same requirements related to military service obligation could be grouped with different advertisements. In the example of the third cluster, it was observed that the advertisements indicating the university graduation departments required for the positions were grouped. In cluster 4, it was observed that the postings containing the experience year requests for the positions were grouped. According to Table 1, the positions that require more than 5 and 5 years of experience are Business Development Manager, Hardware Design Team Leader, Business Development Manager, Senior Software Engineer and Hardware Specialist. Positions with the least expectation of experience were observed as Hardware Engineer, Hardware Support Specialist and Hardware Specialist, which were requested for 2.5 years or less.

Conclusions
Job advertisement analysis studies have become widespread in recent years to determine the necessary qualifications for various positions. It can be said that the researches made for Turkish are limited while a large resource pool is encountered for the English language. Kariyer.Net is the biggest company for the job ads in Turkey and 99% of the ads are Turkish. Therefore, there is a necessity to develop novel Natural Language Processing (NLP) models in Turkish for analysis this big database. In this study, the job ads of Kariyer.Net have been analyzed, and by using a hybrid clustering algorithm, the hidden associations in this dataset as the big data have been discovered. Firstly, all ads in the form of HTML codes have been transformed into regular sentences by the means of extracting HTML codes to inner texts. Then, these inner texts containing the core ads have been converted into the sub ads by traditional methods. After these NLP steps, hybrid clustering algorithms have been used and the same ads expressed with the different sentences could be managed to be detected. For the analysis, 57 positions about Information Technology sectors with 6,897 ad texts have been focused on. As a result, it can be claimed that the clusters obtained contain useful outcomes and the model proposed can be used to discover common and unique ads for each position. The results obtained in Section IV shows that Cluster 1 contains the technical skills, Cluster 2 contains the military competences and Cluster 3 contains the graduation department ads. Lastly, in Cluster 4, it was observed that the postings containing the experience year requests for the positions were grouped. As a result, it can be claimed that the clusters obtained contain useful outcomes and the model proposed can be used to discover common and unique ads for each position.

Author's Note
Abstract version of this paper was presented at 9th International Conference on Advanced Technologies (ICAT'20), 10-12 August 2020, Istanbul, Turkey with the title of "Discovering The Same Job Ads Expressed with The Different Sentences by Using Hybrid Clustering Algorithms".