Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete invalid .traineddata files in cache #753

Closed
Balearica opened this issue May 7, 2023 · 1 comment · Fixed by #757
Closed

Delete invalid .traineddata files in cache #753

Balearica opened this issue May 7, 2023 · 1 comment · Fixed by #757

Comments

@Balearica
Copy link
Collaborator

One of the most common error messages reported is Error opening data file ./eng.traineddata (or the equivalent for other languages). This is due to our current caching behavior.

When a .traineddata file is downloaded, any fetch response reported as ok (which corresponds to a status of 200-299) is cached.

if (!resp.ok) {
throw Error(`Network error while fetching ${fetchUrl}. Response code: ${resp.status}`);
}
data = await resp.arrayBuffer();

The cached file is then used until the user manually deletes it, even if the file is invalid. The assumption this code makes is that an ok response indicates that some .traineddata file was successfully downloaded, and if that file is somehow corrupted, that is because the developer uploaded a corrupted .traineddata file.

This does not appear to be the case. Some server configurations appear to return 200 responses, even if the langPath value is invalid (see #714). Furthermore, given user reports, this may even happen when the default langPath value is used (see #521), although the mechanism for this is unclear.

We should edit so that tesseract.js deletes the saved .traineddata file when it detects that it is invalid. With this change, the next time the code is run it will again try and download the .traineddata file from langPath, rather than re-using the cached data that has already been determined to be invalid.

@Balearica Balearica changed the title Delete invalid .traineddata files in cache Rework cache options, delete invalid .traineddata files in cache May 11, 2023
@Balearica Balearica changed the title Rework cache options, delete invalid .traineddata files in cache Delete invalid .traineddata files in cache May 11, 2023
@Balearica
Copy link
Collaborator Author

Summary of this change

TL;DR Setting cacheMethod: 'none' or cacheMethod: 'refresh' to avoid invalid files being cached should no longer be necessary.

Explanation

By default, Tesseract.js caches .traineddata files to ensure they are only downloaded once. This is because .traineddata files are very large (most common languages are 10-25MB) and are virtually never updated. In certain uses of Tesseract.js, the majority of runtime is attributable to downloading the .traineddata file.

Prior to v4.0.6 there was a bug where cached .traineddata files were never cleared even if they were invalid. Therefore, if a user somehow received an invalid .traineddata file, Tesseract.js would stop working until it was manually cleared (throwing the error "Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.").

Due to this bug, many developers using Tesseract.js started bypassing the caching feature entirely by setting cacheMethod: 'none' or cacheMethod: 'refresh'. This is widely cited in other issues as the solution for the caching bug (e.g. #334, #351, #398 #414 #439, #481, #618, #676).

Starting in v4.0.6 invalid .traineddata files should be automatically cleared from the cache. Therefore, setting cacheMethod: 'none' or cacheMethod: 'refresh' as a workaround for this bug should no longer be necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant