Manual vs Automatic Bitext Extraction Aibek Makazhanov, Bagdat Myrzakhmetov, Zhenisbek Assylbekov



Download 98.41 Kb.
View original pdf
Page1/12
Date27.08.2018
Size98.41 Kb.
  1   2   3   4   5   6   7   8   9   ...   12


Manual vs Automatic Bitext Extraction
Aibek Makazhanov, Bagdat Myrzakhmetov, Zhenisbek Assylbekov
Nazarbayev University, National Laboratory Astana
Nazarbayev University, School of Science and Technology Kabanbay Batyr ave, Astana, Kazakhstan aibek.makazhanov@nu.edu.kz, bagdat.myrzakhmetov@nu.edu.kz, zhassylbekov@nu.edu.kz
Abstract
We compare manual and automatic approaches to the problem of extracting bitexts from the Web in the framework of a case study on building a Russian-Kazakh parallel corpus. Our findings suggest that targeted, site-specific crawling results in cleaner bitexts with a higher ratio of parallel sentences. We also find that general crawlers combined with boilerplate removal tools tend to retrieve shorter texts, as some content gets cleaned outwith the markup. When it comes to sentence splitting and alignment we show that investing some effort in data pre- and post-processing as well as fiddling with off-the-shelf solutions pays a noticeable dividend. Overall we observe that, depending on the source,
automatic bitext extraction methods may lack severely in coverage (retrieve fewer sentence pairs) and on average are fewer precise (retrieve less parallel sentence pairs. We conclude that if one aims at extracting high-quality bitexts fora small number of language pairs, automatic methods best be avoided, or at least used with caution.


Keywords: bitext extraction, crawling, sentence alignment1. introduction
General setting
General vs targeted crawling
Targeted crawling.
Impact of crawling
Acknowledgements

Directory: proceedings


Share with your friends:
  1   2   3   4   5   6   7   8   9   ...   12


The database is protected by copyright ©userg.info 2017
send message

    Main page

bosch
camera
chevrolet
epson
fiat
Honda
iphone
mitsubishi
nissan
Panasonic
Sony
volvo
yamaha