Use smaller .traineddata files by default #750

Balearica · 2023-05-01T03:38:13Z

For certain applications, by far the largest performance bottleneck is downloading the .traineddata file. The default files are very large because (1) we use files that contain both a Legacy and LSTM model and (2) we use an integerized version of the "tessdata_best" models, which are larger. Some comparisons showing the potential savings of using different .traineddata files are below.

English (eng)
1. Current default: 10.4MB
2. LSTM-only, "fast" version: 1.9 MB
Simplified Chinese (chi_sim)
1. Current default: 19.2MB
2. LSTM-only, "fast" version: 1.6MB

I have not experimented with the "fast" vs "best" models, so more research would need to be done before switching the default to a potentially less accurate model. However, simply removing the Legacy model when not specifically requested may result in significantly smaller files with minimal downsides.

In addition to being slow to download a >10MB file before performing a (potentially) small recognition task, large files appear to increase the risk of errors due to network issues. Despite English (likely) being the most popular language, searching Git Issues shows that most of the issues with language data come with using Simplified Chinese (the .traineddata for which is ~2x larger than English).

The text was updated successfully, but these errors were encountered:

Balearica · 2023-08-24T05:16:07Z

Closing as this is covered by #806.

Balearica closed this as completed Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use smaller .traineddata files by default #750

Use smaller .traineddata files by default #750

Balearica commented May 1, 2023

Balearica commented Aug 24, 2023

Use smaller .traineddata files by default #750

Use smaller .traineddata files by default #750

Comments

Balearica commented May 1, 2023

Balearica commented Aug 24, 2023