Improve Progress Logs #598

jwedel · 2022-01-29T17:48:01Z

Is your feature request related to a problem? Please describe.
I am running multiple recognize jobs on multiple workers. It is very hard to implement a simple progress bar for the process.

There are inconsistencies when it come to the initialisation. E.g. we get status initializing api and then status initialized api when it's done. Why not having one status and make use of the progress property? I needed to implement a mapping table to unify the messages: const statusMap = { 'initializing api': 'initialized api', 'initializing tesseract': 'initialized tesseract', 'loading language traineddata': 'loaded language traineddata', }
When working with multiple worker, I need to keep track of the worker ids and multiple initialisation phases (api, training data) and when the job is running, I need to keep track of job ids that run on multiple worker.

So just implement a user friendly 0-100% progress bar is way more complicated than implementing the OCR process itself.

Describe the solution you'd like
It would be great to unify the different initialisation phases. Moreover, it would be nice to get a job pool progress to get the overall progress without needing to collect them manually.

Describe alternatives you've considered
Alternative would be, to first ignore the initialisation and focus on the recognizing, but that in itself is very complicated.

The text was updated successfully, but these errors were encountered:

Mobbbb · 2022-05-07T02:41:49Z

When loading language, if the traineddata doesn't exist in cache, tesseract will download first. But log doesn't return any progress in midway. Only 0 or 1 at beginning or ending. I can only show a faker progress to imitate the real progress, obviously, it's very inaccurate

Balearica · 2023-08-24T05:06:33Z

There are several distinct issues brought up here, so I'll try to respond to each below.

Verbiage Changing with Progress ("Initializing" vs. "Initialized", "Loading" vs. "Loaded")

I agree that the verbiage should be consistent--using "initializing" when progress is 0 and then "initialized" when progress is 1 unnecessarily complicates things. This is proven by the fact that switching from consistent verbiage to inconsistent verbiage broke the loading bars in this repo's own demo site, which remain broken to this day.

Unfortunately, this issue was not introduced recently--it first appears in this commit from 2018 (in the alpha version of Tesseract.js v2). Therefore, "fixing" will be a breaking change that will break people's code. I still think we should make this change, but it will need to happen in a major release (the next release will be v5).

Simplified Progress Reporting

I agree that a simplified progress reporting feature (whether at the worker or scheduler level) could be useful for new users trying to implement basic progress bars. I do not anticipate having the time to develop this, however if somebody else was to implement an option for reporting simplified progress as you describe and it works well I would merge it in.

Language Data Loading Bar (@Mobbbb)

It is true that, at present, Tesseract.js loads a large amount of language data, and this can take a while and appear to stall any loading bar during that time. Unfortunately, I do not believe the Fetch API reports progress when downloading files, nor am I aware of any other way to implement this easily. However, I think this will largely become a moot point once we reduce the amount of language data downloaded by default. Once the changes described in #806 are implemented, the default English .traineddata will decrease from 10.4 MB to 2.95 MB (72% decrease) and the Chinese (simplified) .traineddata will decrease from 20.2 MB to 1.7 MB (94% decrease). Even without incremental progress reported for file downloads, files of this size should not produce a significant stall except on the slowest internet connections. This is another breaking change, so will be implemented in Tesseract.js v5.

Balearica · 2023-09-28T07:50:55Z

The language in progress logs has been standardized in v5. Now, the same verbiage is used for the entirety of each step (no "initializing" vs "initialized", "loading" vs. "loaded", etc.). Additionally, waiting for progress should be much less of an issue as v5 significantly reduced file sizes (50-75%).

Given the above changes, I am closing this issue. If anybody here upgrades to Tesseract.js v5 and still finds reporting progress problematic, they should open a new issue.

Balearica added this to the v5.0 milestone Aug 30, 2023

Balearica closed this as completed Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Progress Logs #598

Improve Progress Logs #598

jwedel commented Jan 29, 2022

Mobbbb commented May 7, 2022

Balearica commented Aug 24, 2023

Balearica commented Sep 28, 2023

Improve Progress Logs #598

Improve Progress Logs #598

Comments

jwedel commented Jan 29, 2022

Mobbbb commented May 7, 2022

Balearica commented Aug 24, 2023

Verbiage Changing with Progress ("Initializing" vs. "Initialized", "Loading" vs. "Loaded")

Simplified Progress Reporting

Language Data Loading Bar (@Mobbbb)

Balearica commented Sep 28, 2023