Error sentences have the original word followed by followed by the error word. MAL originated from Budanitsky and Hirst. It has all 1402 of its error sentences at the end. lcmal is MAL with all but the "" token converted to lowercase T20lc and T62lc were generated using the script insertErrs.py on the noErrs test corpus. This script uses src/c/tmPipe, and the output depends on the language models 20klc.arpa and 62k.arpa. The test sets contain randomized spelling variations and thus cannot be duplicated exactly using this script. Errors are distributed evenly throughout the file at the rate of 1 error per 200 words. bzcat noErrs.bz2 | $SPELL/scripts/insertErrs.py $SPELL/languageModels/lc20k.arpa > T20lc This is done using the makefile.