Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node.js: Loading corrupted language trained data does not throw an error #602

Closed
Razzmatazzz opened this issue Feb 18, 2022 · 1 comment · Fixed by #667
Closed

Node.js: Loading corrupted language trained data does not throw an error #602

Razzmatazzz opened this issue Feb 18, 2022 · 1 comment · Fixed by #667

Comments

@Razzmatazzz
Copy link

Razzmatazzz commented Feb 18, 2022

If the traineddata cache becomes corrupted, tesseract.js will still load it without throwing an error. Then, when the recognize function is called, it results in an uncatchable fatal error.

Steps to reproduce the behavior:

  1. Get a copy of eng.traineddata.gz in the local project folder
  2. Create a blank file named eng.traineddata in the project folder to simulate a corrupted cache
  3. Run the following:
const { createWorker, OEM } = require('tesseract.js');
const Jimp = require('jimp');

(async () => {
    const worker = createWorker({
        langPath: __dirname,
        logger: message => {
            //console.log(message);
        },
        /*errorHandler: error => {
            console.log('error from worker:', error);
        }*/
    });
    try {
        const img = await Jimp.read('https://tesseract.projectnaptha.com/img/eng_bw.png');
        await worker.load();
        await worker.loadLanguage('eng');
        await worker.initialize('eng', OEM.LSTM_ONLY);
        console.log('Recognizing text...');
        const {data: { text } } = await worker.recognize(await img.getBufferAsync(Jimp.AUTO));
        console.log(text);
    } catch (error){
        console.log('caught error:', error);
    }
    process.exit();
})();

This results in the following output:

> tess-test@1.0 start
> node index.js

Error opening data file ./eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Recognizing text...
AdaptedTemplates != nullptr:Error:Assert failed:in file /workspace/tesseract/src/classify/adaptmatch.cpp, line 196
undefined
undefined
C:\Users\Razz\Documents\Visual Studio Code Projects\Razzmatazzz\tesstest\node_modules\tesseract.js\src\createWorker.js:173
        throw Error(data);
        ^

Error: RuntimeError: abort(undefined). Build with -s ASSERTIONS=1 for more info.
    at ChildProcess.<anonymous> (C:\Users\Razz\Documents\Visual Studio Code Projects\Razzmatazzz\tesstest\node_modules\tesseract.js\src\createWorker.js:173:15)
    at ChildProcess.emit (node:events:390:28)
    at emit (node:internal/child_process:917:12)
    at processTicksAndRejections (node:internal/process/task_queues:84:21)

Note the absence of "caught error", indicating that the error is not being caught. The "Error opening data file" output occurs on the worker.initialize() call, but it does not result in an exception being thrown at that point.

If, however, the errorHandler function is enabled, this is what happens:

> tess-test@1.0 start
> node index.js

Error opening data file ./eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Recognizing text...
AdaptedTemplates != nullptr:Error:Assert failed:in file /workspace/tesseract/src/classify/adaptmatch.cpp, line 196
undefined
undefined
error from worker: RuntimeError: abort(undefined). Build with -s ASSERTIONS=1 for more info.
caught error: RuntimeError: abort(undefined). Build with -s ASSERTIONS=1 for more info.

The worker's errorHandler function doesn't receive an error when the initialize function is called, but it does when recognize is called. Also, interestingly, the error triggered by calling the recognize function now becomes catchable.

I would expect the worker.recognize function to throw a catchable error, regardless of whether the user has specified an errorHandler for the worker. I would also expect the worker.initialize function to either throw an error when it can't load the specified traineddata or at least send an error to the errorHandler. Neither is currently done.

@Razzmatazzz Razzmatazzz changed the title Node.js: Loading corrupted language trainded data does not throw an error Node.js: Loading corrupted language trained data does not throw an error Feb 18, 2022
@Balearica
Copy link
Collaborator

Thanks for reporting. I agree that loading corrupted language data should throw an error at the initialize step. Luckily, this looks fairly easy to fix.

Rather than throwing an exception, the Tesseract API returns "0 on success and -1 on initialization failure". We do not check for this at present:

api.Init(null, langs, oem);

The initialize function should be edited to reject the promise when initialization fails. In addition to making sense conceptually, this should resolve the issue where invalid language data does not produce an error until the recognize step (see #602).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants