Search results

Filters

  • Journals
  • Authors
  • Keywords
  • Date
  • Type

Search results

Number of results: 1
items per page: 25 50 75
Sort by:
Download PDF Download RIS Download Bibtex

Abstract

When patterns to be recognised are described by features of continuous type, discretisation becomes either an optional or necessary step in the initial data pre-processing stage. Characteristics of data, distribution of data points in the input space, can significantly influence the process of transformation from real-valued into nominal attributes, and the resulting performance of classification systems employing them. If data include several separate sets, their discretisation becomes more complex, as varying numbers of intervals and different ranges can be constructed for the same variables. The paper presents research on irregularities in data distribution, observed in the context of discretisation processes. Selected discretisation methods were used and their effect on the performance of decision algorithms, induced in classical rough set approach, was investigated. The studied input space was defined by measurable style-markers, which, exploited as characteristic features, facilitate treating a task of stylometric authorship attribution as classification
Go to article

Bibliography

  1.  G. Franzini, M. Kestemont, G. Rotari, M. Jander, J. Ochab, E. Franzini, J. Byszuk, and J. Rybicki, “Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm,” Front. Digital Humanit., vol. 5, p. 4, 2018, doi: 10.3389/fdigh.2018.00004.
  2.  A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, “Data level preprocessing methods,” in Learning from Imbalanced Data Sets. Cham: Springer International Publishing, 2018, pp. 79–121, doi: 10.1007/978-3-319-98074-4_5.
  3.  S. Garcia, J. Luengo, J. Saez, V. Lopez, and F. Herrera, “A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 4, pp. 734–750, 2013, doi: 10.1109/TKDE.2012.35.
  4.  S. Das, S. Datta, and B.B. Chaudhuri, “Handling data irregularities in classification: Foundations, trends, and future challenges,” Pattern Recognit., vol. 81, pp. 674–693, 2018, doi: 10.1016/j.patcog.2018.03.008.
  5.  U. Stańczyk, “Evaluating importance for numbers of bins in discretised learning and test sets,” in Intelligent Decision Technologies 2017: Proceedings of the 9th KES International Conference on Intelligent Decision Technologies (KES-IDT 2017) – Part II, ser. Smart Innovation, Systems and Technologies, I. Czarnowski, J.R. Howlett, and C.L. Jain, Eds. Springer International Publishing, 2018, vol. 72, pp. 159–169, doi: 10.1007/978-3-319-59421-7_15.
  6.  G. Baron, “On approaches to discretization of datasets used for evaluation of decision systems,” in Intelligent Decision Technologies 2016, ser. Smart Innovation, Systems and Technologies, I. Czarnowski, A. Caballero, R. Howlett, and L. Jain, Eds. Springer, 2016, vol. 56, pp. 149–159, doi: 10.1007/978-3-319-39627-9_14.
  7.  U. Stańczyk and B. Zielosko, “On approaches to discretisation of stylometric data and conflict resolution in decision making,” in Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 23rd International Conference KES-2019, Budapest, Hungary, 4‒6 September 2019, ser. Procedia Computer Science, I. J. Rudas, J. Csirik, C. Toro, J. Botzheim, R.J. Howlett, and L.C. Jain, Eds. Elsevier, 2019, vol. 159, pp. 1811– 1820, doi: 10.1016/j.procs.2019.09.353.
  8.  J. Bazan, H. Nguyen, S. Nguyen, P. Synak, and J. Wróblewski, “Rough set algorithms in classification problem,” in Rough Set Methods and Applications: New Developments in Knowledge Discovery in Information Systems, L. Polkowski, S. Tsumoto, and T. Lin, Eds. Heidelberg: Physica-Verlag HD, 2000, pp. 49–88, doi: 10.1007/978-3-7908-1840-6_3.
  9.  J. Bazan and M. Szczuka, “The rough set exploration system,” in Transactions on Rough Sets III, ser. Lecture Notes in Computer Science, J. F. Peters and A. Skowron, Eds. Berlin, Heidelberg: Springer, 2005, vol. 3400, pp. 37–56, doi: 10.1007/11427834_2.
  10.  I. Chikalov, V. Lozin, I. Lozina, M. Moshkov, H. Nguyen, A. Skowron, and B. Zielosko, Three Approaches to Data Analysis – Test Theory, Rough Sets and Logical Analysis of Data, ser. Intelligent Systems Reference Library. Berlin, Heidelberg: Springer, 2013, vol. 41, doi: 10.1007/978-3-642-28667-4.
  11.  Z. Pawlak and A. Skowron, “Rudiments of rough sets,” Inf. Sci., vol. 177, no. 1, pp. 3–27, 2007, doi: 10.1016/j.ins.2006.06.003.
  12.  J. Rybicki, M. Eder, and D. Hoover, “Computational stylistics and text analysis,” in Doing Digital Humanities: Practice, Training, Research, 1st ed., C. Crompton, R. Lane, and R. Siemens, Eds. Routledge, 2016, pp. 123–144, doi: 10.4324/9781315707860.
  13.  M. Eder, “Style-markers in authorship attribution a crosslanguage study of the authorial fingerprint,” Stud. Pol. Ling., vol. 6, no. 1, pp. 99–114, 2011.
  14.  H. Craig, “Stylistic analysis and authorship studies,” in A companion to digital humanities, S. Schreibman, R. Siemens, and J. Unsworth, Eds. Oxford: Blackwell, 2004, doi: 10.1002/9780470999875.ch20.
  15.  G. Baron, “Comparison of cross-validation and test sets approaches to evaluation of classifiers in authorship attribution domain,” in Proceedings of the 31st International Symposium on Computer and Inf. Sci., ser. Communications in Computer and Information Science, T. Czachórski, E. Gelenbe, K. Grochla, and R. Lent, Eds. Cracow: Springer, 2016, vol. 659, pp. 81–89, doi: 10.1007/978-3-319-47217- 1_9.
  16.  S.S. Mullick, S. Datta, S.G. Dhekane, and S. Das, “Appropriateness of performance indices for imbalanced data classification: An analysis,” Pattern Recognit., vol. 102, pp. 107–197, 2020, doi: 10.1016/j.patcog.2020.107197.
  17.  J.M. Johnson and T.M. Khoshgoftaar, “Survey on deep learning with class imbalance,” J. Big Data, vol. 6, no. 27, pp. 1–54, 2019, doi: 10.1186/s40537-019-0192-5.
  18.  N. Basurto, C. Cambra, and Á. Herrero, “Improving the detection of robot anomalies by handling data irregularities,” Neurocomputing, 2020, doi: 10.1016/j.neucom.2020.05.101, in press.
  19.  G. Shi, C. Feng,W. Xu, L. Liao, and H. Huang, “Penalized multiple distribution selection method for imbalanced data classification,” Knowledge-Based Syst., vol. 196, p. 105833, 2020, doi: 10.1016/j.knosys.2020.105833.
  20.  S. Au, R. Duan, S.G. Hesar, and W. Jiang, “A framework of irregularity enlightenment for data pre-processing in data mining,” Ann. Oper. Res., vol. 174, no. 1, pp. 47–66, 2010, doi: 10.1007/s10479-008-0494-z.
  21.  M. Koziarski, M. Wozniak, and B. Krawczyk, “Combined cleaning and resampling algorithm for multi-class imbalanced data with label noise,” Knowledge-Based Syst., vol. 204, p. 106223, 2020, doi: 10.1016/j.knosys.2020.106223.
  22.  N. Basurto, Á. Arroyo, C. Cambra, and Á. Herrero, “Imputation of missing values affecting the software performance of component-based robots,” Comput. Electr. Eng., vol. 87, p. 106766, 2020, doi: 10.1016/j.compeleceng.2020.106766.
  23.  S. Argamon, K. Burns, and S. Dubnov, Eds., The structure of style: Algorithmic approaches to understanding manner and meaning. Berlin: Springer, 2010, doi: 10.1007/978-3-642-12337-5.
  24.  S. Sbalchiero and M. Eder, “Topic modeling, long texts and the best number of topics. some problems and solutions,” Qual. Quant., vol. 54, pp. 1095–1108, 2020, doi: 10.1007/s11135-020-00976-w.
  25.  R. Peng and H. Hengartner, “Quantitative analysis of literary styles,” Am. Statistician, vol. 56, no. 3, pp. 15–38, 2002, doi: 10.1198/000313002100.
  26.  E. Stamatatos, “A survey of modern authorship attribution methods,” J. Am. Soc. Inf. Sci. Technol., vol. 60, no. 3, pp. 538–556, 2009, doi: 10.1002/asi.21001.
  27.  D. Khmelev and F. Tweedie, “Using Markov chains for identification of writers,” Lit. Linguist. Comput., vol. 16, no. 4, pp. 299–307, 2001, doi: 10.1093/llc/16.3.299.
  28.  M. Koppel, J. Schler, and S. Argamon, “Computational methods in authorship attribution,” J. Am. Soc. Inf. Sci. Technol., vol. 60, no. 1, pp. 9–26, 2009, doi: 10.1002/asi.20961.
  29.  M. Jockers and D. Witten, “A comparative study of machine learning methods for authorship attribution,” Lit. Linguist. Comput., vol. 25, no. 2, pp. 215–223, 2010, doi: 10.1093/llc/fqq001.
  30.  M. Eder and J. Rybicki, “Do birds of a feather really flock together, or how to choose training samples for authorship attribution,” Lit. Linguist. Comput., vol. 28, pp. 229–236, 8 2013, doi: 10.1093/llc/fqs036.
  31.  M. Eder, “Does size matter? Authorship attribution, small samples, big problem,” Digital Scholarsh. Humanit., vol. 30, pp. 167–182, 06 2015, doi: 10.1093/llc/fqt066.
  32.  K. Kalaivani and S. Kuppuswami, “Exploring the use of syntactic dependency features for document-level sentiment classification,” Bull. Pol. Acad. Sci. Tech. Sci., vol. 67, no. 2, pp. 339–347, 2019, doi: 10.24425/bpas.2019.128608.
  33.  G. Rotari, M. Jander, and J. Rybicki, “The Grimm brothers: A stylometric network analysis,” Digital Scholarsh. Humanit., 02 2020, doi: 10.1093/llc/fqz088.
  34.  C. Jankowski, D. Reda, M. Mańkowski, and G. Borowik, “Discretization of data using Boolean transformations and information theory based evaluation criteria,” Bull. Pol. Acad. Sci. Tech. Sci., vol. 63, no. 4, pp. 923–932, 2015, doi: 10.1515/bpasts-2015-0105.
  35.  U. Fayyad and K. Irani, “Multi-interval discretization of continuous valued attributes for classification learning,” in Proceedings of the 13th International Joint Conference on Artificial Intelligence, vol. 2. Morgan Kaufmann Publishers, 1993, pp. 1022–1027.
  36.  I. Kononenko, “On biases in estimating multi-valued attributes,” in Proceedings of the 14th International Joint Conference on Artificial Intelligence IJCAI’95, vol. 2. Morgan Kaufmann Publishers Inc., 1995, pp. 1034–1040.
  37.  J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, no. 5, pp. 465–471, 1978, doi: 10.1016/0005-1098(78)90005-5.
  38.  S. Kotsiantis and D. Kanellopoulos, “Discretization techniques: A recent survey,” GESTS Int. Trans. Comput. Sci. Eng., vol. 32, no. 1, pp. 47–58, 2006.
  39.  B. Zielosko, “Application of dynamic programming approach to optimization of association rules relative to coverage and length,” Fundamenta Informaticae, vol. 148, no. 1-2, pp. 87–105, 2016, doi: 10.3233/FI-2016-1424.
  40.  S.G. Weidman and J. O’Sullivan, “The limits of distinctive words: Re-evaluating literature’s gender marker debate,” Digital Scholarsh. Humanit., vol. 33, pp. 374–390, 2018, doi: 10.1093/llc/fqx017.
Go to article

Authors and Affiliations

Urszula Stańczyk
1
Beata Zielosko
2

  1. Silesian University of Technology, ul. Akademicka 2A, 44-100 Gliwice, Poland
  2. University of Silesia in Katowice, ul. Będzińska 39, 41-200 Sosnowiec, Poland

This page uses 'cookies'. Learn more