8
Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE Summary I. General presentation............................................................................... 2 II. Binarisation ............................................................................................. 2 III. Segmentation ........................................................................................... 3 IV. OCR Recognition .................................................................................... 4 V. Sequencer ................................................................................................. 5 VI. Post-OCR correction with Spellchecking ............................................. 6 VII. Pictures Treatment/Export .................................................................... 7 VIII. Export of content: ................................................................................... 7 IX. Contact ..................................................................................................... 8

BIT Alpha - ICoC

Embed Size (px)

Citation preview

Bureau Ingénieur Tomasi S.A.R.L.

Adaptive OCR

Solution

______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE

Summary

I. General presentation............................................................................... 2

II. Binarisation ............................................................................................. 2

III. Segmentation ........................................................................................... 3

IV. OCR Recognition .................................................................................... 4

V. Sequencer ................................................................................................. 5

VI. Post-OCR correction with Spellchecking ............................................. 6

VII. Pictures Treatment/Export .................................................................... 7

VIII. Export of content: ................................................................................... 7

IX. Contact ..................................................................................................... 8

Bureau Ingénieur Tomasi S.A.R.L.

Adaptive OCR

Solution

______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE

I. General presentation

B.I.T. has developed an adaptive OCR solution called BIT-Alpha.

This semiautomatic adaptive OCR is able to adapt itself to all types of text,

independently of their language, typeface or age.

Specifically developed for the treatment of historical and heritage documents,

BIT-Alpha allows scientific research and access to content.

BIT-Alpha is a tool containing the whole workflow:

Binarisation

Segmentation

OCR recognition

Post OCR correction with spellchecking

Picture processing/Export

Export of content

II. Binarisation

3 Binarisation modes in BIT-Alpha:

A Binarisation through Threshold ideal for Newspapers

BIT-Alpha analyses the document by domains/fields so the Binarisation will not

be the same at the bottom, top or left right corner… Through this domains/fields

analysis instead of a global analysis of the whole document, the binarisation will

adapt to the different contrasts of the document.

A Binarisation through the “Niblack” algorithm

BIT-Alpha is analyzing the contrast variance around each letter. In this respect

BIT-Alpha is able to make the difference between a letter and a color spot close

to a letter and therefore is able to eliminate the background noise without

eliminating parts of a letter.

Bureau Ingénieur Tomasi S.A.R.L.

Adaptive OCR

Solution

______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE

BIT-Alpha does the variance analysis over neighborhoods and so determines if a

pixel is part of a text area, non-text area, interline or a picture.

A Binarisation based on an algorithm develop by B.I.T.

Thanks to this very advanced spectral-decomposition algorithm, BIT-Alpha is

able to redraw/reconstruct damaged letters, as if BIT-Alpha were choosing an

optimal paint brush (fine or large). It also allows to maintain very fine traits of

characters which may be deleted by other algorithms.

Those binarisation allows to prepare the document as best as possible in order to

get the best OCR results that are possible for these historic/ heritage documents.

III. Segmentation

BIT-Alpha is segmenting titles, sub-titles, pictures, picture comments, chapters

and articles, for example in Newspapers:

Fraktur dated 1805 at 1944: segmentation of title, sub-titles and chapters

Bureau Ingénieur Tomasi S.A.R.L.

Adaptive OCR

Solution

______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE

During segmentation BIT-Alpha is detecting each line, for each line each word

and for each word each character individually.

Note that Bit-Alpha can output the position of each character (for example into

an alto file).

IV. OCR Recognition

Developed for the processing of historical/ heritage documents, BIT-Alpha is an

adaptive OCR able of adapting itself to all types of text, independently of their

language, typeface or age.

Character learning can be done manually and automatically:

Manually

Training with human action:

Memory storage of characters’ digital signatures.

As the “image” of a character is much heavier than its digital signature, BIT-Alpha

has the ability to create bigger data bases than tools saving “images” of characters.

Automatic

Training without human action:

BIT-Alpha can learn the characters automatically from the text to be

processed. During a Batch process, BIT-Alpha is reading and recognizing

characters already known those characters which are recognized with high

reliability are then used to train the OCR engine. Thereby, BIT-Alpha’s

reliability rates will be increase with each processed page.

A spellchecking database which is adapted to the type of documents that

are to be treated (for example Latin database) can be loaded into BIT-

Alpha. If BIT-Alpha recognizes a word from the database, BIT-Alpha

learns all the character constituting this word automatically. BIT-Alpha can

handle any databases consisting of more than 500 000 words.

BIT-Alpha is able to identify the nature of fonts constituting a text even

when the fonts are mixed-up: Gothic (before 1845), Fracture (after 1845),

Antiqua, Cursive, Greece, Hebrew...

Bureau Ingénieur Tomasi S.A.R.L.

Adaptive OCR

Solution

______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE

BIT-Alpha is able to recognize and read embellished letters, miniatures,

abbreviations and can deal with unusual characters.

V. Sequencer

The Sequencer permits to:

Reconstruct fragmented characters: Sometimes a letter can be fragmented

into two or more parts. BIT-Alpha recognises the fragments of a letter and

reconstitutes it.

Recognition of the right hand side of a lower-case “n” (RKN)

Recognition of the left hand side of a lower-case “n” (LKN)

Bureau Ingénieur Tomasi S.A.R.L.

Adaptive OCR

Solution

______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE

Assembling of the two fragments by the sequencer and reconstruction of

the “n”

Extend abbreviations

In Roman writing a “q” followed by ”;” means “que”.

Correct wrong sequences of letters

When other OCR reads “nnn”, the sequencer corrects that to « mm ». BIT-

Alpha considers the typical sequences of the language of the document

processed and is therefore able to correct incorrect sequences of letters.

For example in Latin the wrong sequence “dcn” is changed into the typical

one: “den”. Another example would be the incorrect sequence “qn” which is

changed changed into the typical one used in Latin: “qu”.

The Sequencer is composed of more than 900 sequences preprogramed in BIT-

Alphas’s data base. By each use, the Sequencer’s data base can be enhanced

and conversely the sequences preprogramed disturbing can be removed.

VI. Post-OCR correction with Spellchecking

BIT-Alpha’s post-OCR correction is based on the “Levenshtein” distance

algorithm. Alpha analyses the edit-distance (different editing operations

correspond to different OCR-mistakes and may have different weights) between

two words, the words in the text and the reference from the database. Thanks to

this technology BIT-Alpha is able to reconstitute words or to separate them with

blanks if needed. For example, in German composed words (very common in

Bureau Ingénieur Tomasi S.A.R.L.

Adaptive OCR

Solution

______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE

German) may be checked by checking the components individually against

known words from the database. Whereas for Latin texts (where composed words

rarely occur) BIT-Alpha separates the words that are sticking together with

blanks.

BIT-Alpha permit to switch off the post-OCR correction and also to adapt how

aggressively it corrects pure OCR results.

VII. Pictures Treatment/Export

BIT-Alpha has very advanced technology for the processing of pictures (for

example in newspapers).

BIT-Alpha is able to detect pictures, to delete interpolate dithered images and to

deliver a high-quality true-color digital image.

Dithered image (binary): Interpolated image without dithering

(greyscale):

VIII. Export of content:

The results can be rendered in different formats, for example:

Txt

Pdf with Highlighting (text as transparent overlay over the original image,

allowing to search, select, copy)

BIT-Alpha creates a lightweight pdf by reducing the resolution (dpi) of the

document in order facilitate exchange of the document or online

publication.

Alto (pixel or 10 de mm)

Tei

Bureau Ingénieur Tomasi S.A.R.L.

Adaptive OCR

Solution

______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE

Html

The Html export from BIT-Alpha keeps mathematical formula, pictures,

etc. and positions them at the same place where they were in the original

document.

IX. Contact

Head of sales department

Anne Tomasi,

+33 786 844 845

[email protected]