Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AnalyseLayout() for tesseract.js #656

Closed
ghost opened this issue Sep 1, 2022 · 5 comments · Fixed by #770
Closed

AnalyseLayout() for tesseract.js #656

ghost opened this issue Sep 1, 2022 · 5 comments · Fixed by #770

Comments

@ghost
Copy link

ghost commented Sep 1, 2022

Is your feature request related to a problem? Please describe.
Currently it is not possible to perform a fast document layout analysis.

Describe the solution you'd like
The function AnalyseLayout() is present in tesseract C++ and I have seen that there is something present in the tesseract.js-core inside the glue.js file:
https://github.com/naptha/tesseract.js-core/blob/82c349860e5d0cd81449761077d0d113fdf04c1b/javascript/glue.js#L1481

The AnalyzeLayout function makes a very fast analysis of the document returning the document segmented in boxes.

Describe alternatives you've considered
Using the regular worker.recognize() function is possible to perform layout analysis working with the TSV output but this does require e full analysis wheras the function AnalyseLayout() uses another method that is much more immediate and can define the zones to later perfomr a worker.recognize().

Additional context
Using gImageReader with tesseract
image

@Balearica
Copy link
Collaborator

I have no opposition to adding this, although probably won't have time personally (in the near future). Will likely require an interested user to develop an interface. As you note, the necessary API functions do appear to be exposed already (in the glue file), so it would presumably just require building an interface around that using JavaScript.

@ghost
Copy link
Author

ghost commented Sep 14, 2022

@Balearica Thank you for the reply. Could you briefly describe where and what should be modified/added ?

I will see if I can do this. I would need some directions.

@Balearica
Copy link
Collaborator

@mattiaCanevascini I have never used this particular function, however can speak to development more broadly. The first step of exposing a new feature is cloning Tesseract.js-core and familiarizing yourself with the examples.

For example, this is a basic recognition example in Tesseract.js-core. In contrast to the recognition example in Tesseract.js, you'll note that it uses lower-level functions (calls to methods of api and TessModule). Those are the building blocks for everything in this repo. Once you understand the examples, you can work to implement a proof-of-concept using additional functions from that repo.

Virtually every Tesseract API function is already included in Tesseract.js-core (including, as you note, api.AnalyseLayout ). What those functions lack is (1) documentation and (2) a user-friendly interface. Therefore, it's a matter of figuring out how the functions work, creating a user-friendly interface, and documenting it.

@ghost
Copy link
Author

ghost commented Sep 14, 2022

@Balearica thank you for the description. It was what I was looking for. I will try :)

@Balearica
Copy link
Collaborator

Balearica commented May 29, 2023

I added the ability to run layout analysis but not recognition to the master branch. It is included in releases starting at v4.1.0.

Running only layout analysis requires setting the output option for the recognize method. You need to (1) disable any outputs that require running recognition [notably the formats that are true by default] and (2) set the new layoutBlocks output format to true. An example is below.

await worker.recognize(files[0], undefined, {text: false, blocks: false, hocr: false, tsv: false, layoutBlocks: true});

The layoutBlocks output format is identical to the blocks output format in structure, and allows for retrieving bounding boxes for text blocks/paragraphs/lines/etc. Only blocks can be created if recognition has been run, and only layoutBlocks can be created if recognition has been skipped. With regards to content, the only difference should be that layoutBlocks has null values for all text and confidence fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant