Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 1271149

A Comprehensive Dataset of Spelling Errors and Users’ Corrections in Croatian Language


Gledec, Gordan; Horvat, Marko; Mikuc, Miljenko; Blašković, Bruno
A Comprehensive Dataset of Spelling Errors and Users’ Corrections in Croatian Language // Data, 8 (2023), 5; 89, 11 doi:10.3390/data8050089 (međunarodna recenzija, članak, znanstveni)


CROSBI ID: 1271149 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
A Comprehensive Dataset of Spelling Errors and Users’ Corrections in Croatian Language

Autori
Gledec, Gordan ; Horvat, Marko ; Mikuc, Miljenko ; Blašković, Bruno

Izvornik
Data (2306-5729) 8 (2023), 5; 89, 11

Vrsta, podvrsta i kategorija rada
Radovi u časopisima, članak, znanstveni

Ključne riječi
spellchecker ; n-grams ; natural language processing ; Croatian language ; user corrections dataset ; common error analysis

Sažetak
This paper presents a unique and extensive dataset containing over 33 million entries with pairs in the form “spelling error → correction” from ispravi.me, the most popular Croatian online spellchecking service, collected since 2008. The dataset, compiled from the contribution of nearly 900, 000 users, is a valuable resource for researchers and developers in the field of natural language processing (NLP), improving spellcheck accuracy, and language learning applications. The dataset may be used to accomplish several goals: (1) improving spellchecking accuracy by incorporating common user corrections and reducing false positives and negatives ; (2) helping language learners identify common errors and learn correct spelling through targeted feedback ; (3) analyzing data trends and patterns to uncover the most common spelling errors and their underlying causes ; (4) identifying and evaluating factors that influence typing input ; (5) improving NLP applications such as text recognition and machine translation. Tasks specific to the Croatian language include the creation of a letter-level confusion matrix and the refinement of word suggestions based on historical usage of the service. This comprehensive dataset provides researchers and practitioners with a wealth of information, opening the path for advancements in spellchecking, language learning, and NLP applications in the Croatian language.

Izvorni jezik
Engleski

Znanstvena područja
Računarstvo



POVEZANOST RADA


Ustanove:
Fakultet elektrotehnike i računarstva, Zagreb

Profili:

Avatar Url Bruno Blašković (autor)

Avatar Url Marko Horvat (autor)

Avatar Url Miljenko Mikuc (autor)

Avatar Url Gordan Gledec (autor)

Poveznice na cjeloviti tekst rada:

Pristup cjelovitom tekstu rada doi www.mdpi.com

Citiraj ovu publikaciju:

Gledec, Gordan; Horvat, Marko; Mikuc, Miljenko; Blašković, Bruno
A Comprehensive Dataset of Spelling Errors and Users’ Corrections in Croatian Language // Data, 8 (2023), 5; 89, 11 doi:10.3390/data8050089 (međunarodna recenzija, članak, znanstveni)
Gledec, G., Horvat, M., Mikuc, M. & Blašković, B. (2023) A Comprehensive Dataset of Spelling Errors and Users’ Corrections in Croatian Language. Data, 8 (5), 89, 11 doi:10.3390/data8050089.
@article{article, author = {Gledec, Gordan and Horvat, Marko and Mikuc, Miljenko and Bla\v{s}kovi\'{c}, Bruno}, year = {2023}, pages = {11}, DOI = {10.3390/data8050089}, chapter = {89}, keywords = {spellchecker, n-grams, natural language processing, Croatian language, user corrections dataset, common error analysis}, journal = {Data}, doi = {10.3390/data8050089}, volume = {8}, number = {5}, issn = {2306-5729}, title = {A Comprehensive Dataset of Spelling Errors and Users’ Corrections in Croatian Language}, keyword = {spellchecker, n-grams, natural language processing, Croatian language, user corrections dataset, common error analysis}, chapternumber = {89} }
@article{article, author = {Gledec, Gordan and Horvat, Marko and Mikuc, Miljenko and Bla\v{s}kovi\'{c}, Bruno}, year = {2023}, pages = {11}, DOI = {10.3390/data8050089}, chapter = {89}, keywords = {spellchecker, n-grams, natural language processing, Croatian language, user corrections dataset, common error analysis}, journal = {Data}, doi = {10.3390/data8050089}, volume = {8}, number = {5}, issn = {2306-5729}, title = {A Comprehensive Dataset of Spelling Errors and Users’ Corrections in Croatian Language}, keyword = {spellchecker, n-grams, natural language processing, Croatian language, user corrections dataset, common error analysis}, chapternumber = {89} }

Časopis indeksira:


  • Web of Science Core Collection (WoSCC)
    • Emerging Sources Citation Index (ESCI)
  • Scopus


Uključenost u ostale bibliografske baze podataka::


  • INSPEC
  • CNKI
  • dblp Computer Science Bibliography
  • Digital Science
  • DOAJ
  • EBSCO
  • Gale
  • OpenAIRE
  • OSTI (U.S. Department of Energy)
  • ProQuest
  • RePEc
  • SafetyLit


Citati:





    Contrast
    Increase Font
    Decrease Font
    Dyslexic Font