Research Article

Towards SMS Spam Filtering: Results under a New Dataset

Volume: 2 Number: 1 March 31, 2013
  • Tiago Almeida
  • José María Hidalgo
  • Tiago Silva
EN

Towards SMS Spam Filtering: Results under a New Dataset

Abstract

The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. Recent reports clearly indicate that the volume of mobile phone spam is dramatically increasing year by year. In practice, fighting such plague is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. Probably, one of the major concerns in academic settings is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, traditional content-based filters may have their performance seriously degraded since SMS messages are fairly short and their text is generally rife with idioms and abbreviations. In this paper, we present details about a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we offer a comprehensive analysis of such dataset in order to ensure that there are no duplicated messages coming from previously existing datasets, since it may ease the task of learning SMS spam classifiers and could compromise the evaluation of methods. Additionally, we compare the performance achieved by several established machine learning techniques. Im summary, the results indicate that the procedure followed to build the collection does not lead to near-duplicates and, regarding the classifiers, the Support Vector Machines outperforms other evaluated techniques and, hence, it can be used as a good baseline for further comparison.

Keywords

References

  1. T. Almeida, J. Gómez Hidalgo, and A. Yamakami, “Contri- butions to the Study of SMS Spam Filtering: New Collection and Results,” in Proceedings of the 2011 ACM Symposium on Document Engineering, Mountain View, CA, USA, 2011, pp. 259–262.
  2. J. M. Gómez Hidalgo, T. A. Almeida, and A. Yamakami, “On the Validity of a New SMS Spam Collection,” in Proceedings of the 2012 IEEE International Conference on Machine Learning and Applications, Boca Raton, FL, USA, 2012, pp. 240–245.
  3. J. M. Gómez Hidalgo, “Evaluating Cost-Sensitive Unsolicited Bulk Email Categorization,” in Proceedings of the 17th ACM Symposium on Applied Computing, Madrid, Spain, 2002, pp. 615–620.
  4. L. Zhang, J. Zhu, and T. Yao, “An Evaluation of Statistical Spam Filtering Techniques,” ACM Transactions on Asian Language Information Processing, vol. 3, no. 4, pp. 243–269, 2004.
  5. G. Cormack, “Email Spam Filtering: A Systematic Review,” Foundations and Trends in Information Retrieval, vol. 1, no. 4, pp. 335–455, 2008.
  6. T. A. Almeida, A. Yamakami, and J. Almeida, “Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters,” in Proceedings of the 8th IEEE In- ternational Conference on Machine Learning and Applications, Miami, FL, USA, 2009, pp. 517–522.
  7. ——, “Filtering Spams using the Minimum Description Length Principle,” in Proceedings of the 25th ACM Symposium On Applied Computing, Sierre, Switzerland, 2010, pp. 1856–1860.
  8. ——, “Probabilistic Anti-Spam Filtering with Dimensionality Reduction,” in Proceedings of the 25th ACM Symposium On Applied Computing, Sierre, Switzerland, 2010, pp. 1804–1808.

Details

Primary Language

English

Subjects

Applied Mathematics

Journal Section

Research Article

Authors

Tiago Almeida This is me

José María Hidalgo This is me

Tiago Silva This is me

Publication Date

March 31, 2013

Submission Date

January 30, 2016

Acceptance Date

-

Published in Issue

Year 2013 Volume: 2 Number: 1

APA
Almeida, T., Hidalgo, J. M., & Silva, T. (2013). Towards SMS Spam Filtering: Results under a New Dataset. International Journal of Information Security Science, 2(1), 1-18. https://izlik.org/JA53MP94ZP
AMA
1.Almeida T, Hidalgo JM, Silva T. Towards SMS Spam Filtering: Results under a New Dataset. IJISS. 2013;2(1):1-18. https://izlik.org/JA53MP94ZP
Chicago
Almeida, Tiago, José María Hidalgo, and Tiago Silva. 2013. “Towards SMS Spam Filtering: Results under a New Dataset”. International Journal of Information Security Science 2 (1): 1-18. https://izlik.org/JA53MP94ZP.
EndNote
Almeida T, Hidalgo JM, Silva T (March 1, 2013) Towards SMS Spam Filtering: Results under a New Dataset. International Journal of Information Security Science 2 1 1–18.
IEEE
[1]T. Almeida, J. M. Hidalgo, and T. Silva, “Towards SMS Spam Filtering: Results under a New Dataset”, IJISS, vol. 2, no. 1, pp. 1–18, Mar. 2013, [Online]. Available: https://izlik.org/JA53MP94ZP
ISNAD
Almeida, Tiago - Hidalgo, José María - Silva, Tiago. “Towards SMS Spam Filtering: Results under a New Dataset”. International Journal of Information Security Science 2/1 (March 1, 2013): 1-18. https://izlik.org/JA53MP94ZP.
JAMA
1.Almeida T, Hidalgo JM, Silva T. Towards SMS Spam Filtering: Results under a New Dataset. IJISS. 2013;2:1–18.
MLA
Almeida, Tiago, et al. “Towards SMS Spam Filtering: Results under a New Dataset”. International Journal of Information Security Science, vol. 2, no. 1, Mar. 2013, pp. 1-18, https://izlik.org/JA53MP94ZP.
Vancouver
1.Tiago Almeida, José María Hidalgo, Tiago Silva. Towards SMS Spam Filtering: Results under a New Dataset. IJISS [Internet]. 2013 Mar. 1;2(1):1-18. Available from: https://izlik.org/JA53MP94ZP