OCR Using Python and Its Application

Optical Character Recognition (OCR) of papers has tremendous practical value given the prevalence of handwritten documents in human exchanges. A discipline known as optical character recognition makes it possible to convert many kinds of texts or photos into editable, searchable


Introduction
OCR means Optical Character Recognition.A technology which is used to convert scanned images, PDFs, and other types of documents into machine-readable text.OCR software works by analysing the patterns of light and dark pixels in an image and trying to recognize the characters represented by those patterns.
OCR is commonly used in document management systems, where it is used to convert paper documents into digital form so that they can be easily searched, indexed, and archived.OCR is also used in the development of assistive technologies for people with visual impairments, as it can be used to convert printed text into speech or braille.
Overall, OCR technology is a useful tool for organizations and individuals looking to digitize and analyse information from paper documents.As OCR technology continues to evolve, it is likely that we will see even more advanced capabilities in the future, making it easier and more efficient to work with large amounts of information in digital form.
OCR technology has many advantages that make it useful tool for many applications, including: 1. Timesaving: OCR technology can automate the process of manually transcribing text, which can save a lot of time and effort, particularly for organizations with large volumes of documents.
2. Improved accuracy: While OCR technology is not perfect, it can significantly reduce errors compared to manual data entry.It can improve the accuracy and reliability of data and reduce the risk of costly mistakes.3. Better data analysis: By digitizing text, OCR technology makes it possible to analyse and process data more efficiently.This can provide insights and information that may not have been possible to obtain from paper documents.
Overall, OCR technology has numerous applications across many different industries and can provide significant benefits in terms of efficiency, accuracy, and accessibility.As OCR technology continues to evolve, it is likely that we will see even more applications in the future.
OCR technology can be implemented using various methods, including: 1. Software-based OCR: This method involves using OCR software to scan and convert paper documents into digital formats.The software typically uses optical character recognition algorithms to detect and convert text characters.2. Cloud-based OCR: This method involves using OCR technology that is hosted on cloud servers.Users can upload documents to the cloud service, where they are automatically processed and converted into digital formats.3. Mobile OCR: This method involves using OCR technology that is integrated into mobile devices, such as smartphones and tablets.This allows users to scan and convert documents on-the-go, without the need for additional equipment.
OCR technology is not without its limitations and challenges.Some of these include Accuracy issues, language barriers, layout issues, cost, security, and privacy.

Data Collection and Storage:
We stored all the intermediate images in a folder.At the end we store all the collected and processed data in a CSV file.Extracted data will appear in this format and can be saved in a text file or csv file.

Literature review:
The method of categorizing optical patterns in relation to alphanumeric or other characters is known as optical character recognition.Segmentation, feature extraction, and classification are also included [1].This literature review explores Optical Character Recognition (OCR) technology, focusing on its implementation using Python and its diverse applications.The primary objective is to create searchable PDFs from images [2].The process of converting printed text into editable text was done using optical character recognition (OCR) technology.OCR is an extremely useful and well-liked method that is used in many different applications.
Techniques for text preparation and segmentation can affect OCR accuracy.[3] Various OCR methods, especially those employing Python, are discussed, including the use of OCR software such as PaddleOCR.Many works have been already done by many software companies like google, Microsoft, and amazon and some other companies which are paid and available on their cloud.[4].Virtually generated 3D worlds have recently grown in prominence to reduce the requirement for manually tagged photographs.Unfortunately, producing realistic 3D content is difficult and time-consuming on its own [6].Future research includes optimizing the system for mobile phone implementation with limited CPU and memory resources, and geotagging of the image using GPS coordinates and online database for various mobile applications [11].Edge offers various textures and concise ideas about every object's shape.Over the past few decades, many algorithms have been published and use.In this paper, a novel edge detection method is put forth to implement an ideal edge detection method, one that can handle varying light luminosity on colour images, operate under various lighting conditions, and offer the highest accuracy, maximum effectiveness, maximum signal-to-noise ratio (SNR), and minimum mean squared error (MSE) [12].There is also some open-source software for OCR providing by some companies like tesseract-OCR by Google (Google have done excellent work on tesseract OCR but the open-source part of this software does not provide efficient results.)and some software are lanyocr, mmlabs, paddleocr, etc.The review concludes with insights into creating Docker images for Python applications and deploying them using Kubernetes.There are many ways to do so which are (i) OCR Software, (ii) Adobe Acrobat Pro, (iii) Google Drive, (iv) Microsoft Word, (v) Online OCR Tools.In this context we are using first one method which is OCR Software.In this method, we choose a OCR model to do extract text from the image of our choice using software and hardware resources.There is many OCR software in the market developed upon different programming languages.Some of which are open source, and some are paid ones.In this context, some of the OCR software examples given which are using python programming language and one which we are using to do OCR is paddleocr (paddleocr uses language python).In the end, we have discussed how to create docker image of the python application we have created and how to publish it on dockerhub and how to run it on on-premises using Kubernetes.It is most acceptable to group statistics, database technology, information discovery, pattern recognition, machine learning, business, natural disasters, and other fields under the umbrella of data mining [15].Data mining will effectively introduce the computing strategies and techniques to retrieve the applicable and convenient information from combined large databases known as big data [16].To do data mining we need to collect at least certain amount of data and some data maybe collect through images and videos.To process such type of data we need to convert data in images to computer readable text which can be done using OCR technology.

Methodology Using Python
Basically, OCR can be done using only python library or the software build upon these libraries.

1) OCR using python inbuilt libraries:
In python, OCR is done using pytesseract library.Steps to do ocr using python inbuilt library are: i) first import cv function ii) then open an image using imread function from cv2 library.Iii) and then read text using function image_to_string from pytesseract library.Algorithm: import pytesseract custom_config = r'-l eng+por --psm 6' txt = pytesseract.image_to_string(img,config=custom_config) print(txt)

2) Ocr using software build upon python libraries:
There

Result and Discussion
Result obtained by applying OCR Methodology

Searchable pdf:
In the era of digitalization, the importance of searchable PDFs has increased dramatically.A searchable PDF is a type of document that allows users to search for specific words, phrases, or characters within the text of the document.In this paper, we present a comprehensive study on the techniques and applications of searchable PDFs.We discuss the advantages and disadvantages of searchable PDFs over other types of documents.We also present various techniques for creating searchable PDFs, including Optical Character Recognition (OCR) and automated indexing.Finally, we explore the applications of searchable PDFs in various fields, including education, healthcare, legal, and business.
The PDF (Portable Document Format) is one of the most widely used document formats in the world.It is a file format that preserves the layout, fonts, and graphics of a document, regardless of the software or hardware used to view it.A searchable PDF is a type of PDF that includes an OCR (Optical Character Recognition) layer, which allows the text within the PDF to be searched, selected, and copied.The OCR technology recognizes text within the image of the document and then converts it into searchable and editable text.
Advantages of Searchable PDFs: Searchable PDFs have numerous advantages over other types of documents.
For instance, they are much easier to search and navigate, which makes them more convenient for users.Additionally, they allow users to copy and paste text, which can save time and effort.Furthermore, searchable PDFs are more accessible to individuals with visual impairments, as they can use screen readers to read the text within the PDF.

Techniques for Creating Searchable PDFs:
There are several techniques for creating searchable PDFs.One of the most common techniques is OCR, which involves using software to recognize the text within an image of a document and then converting it into searchable text.OCR technology has advanced significantly in recent years, and now it can recognize various fonts, languages, and even handwriting.Another technique for creating searchable PDFs is automated indexing, which involves automatically extracting and indexing the text within a document.This technique can be useful for large-scale document processing, such as digitizing archives and libraries.
Applications of Searchable PDFs: Searchable PDFs have numerous applications in various fields, including education, healthcare, legal, and business.In education, searchable PDFs can be used to create digital textbooks that allow students to search for specific concepts and keywords.In healthcare, searchable PDFs can be used to create patient records that can be searched and shared with other healthcare providers.In legal, searchable PDFs can be used to create electronic legal documents that can be searched and shared with other attorneys.In business, searchable PDFs can be used to create electronic contracts, invoices, and receipts that can be searched and shared with other business partners.
In conclusion, searchable PDFs are becoming increasingly important in the digital age, and their advantages over other types of documents are clear.By using OCR and automated indexing techniques, searchable PDFs can be created quickly and efficiently.Furthermore, their applications in various fields make them essential for modern document processing and management.
To create a searchable pdf from an image, we must follow some instructions which are: i) To convert given image into pdf.
ii) now we need to extract text from the image, i.e., doing OCR on image (you can choose OCR software of your type but in this paddleocr is being chosen.*Important: writing OCR text on blank pdf is based on the output of the OCR software, in the case of paddleocr it returns the boxes of the lines of the text and text itself in a tuple and that information is used to plot character on the blank pdf.).
iii) and then we need another pdf which are of same width and size of image having written text as per written on the image (means that words on the image and another pdf should be on same coordinate).iv) after this we must merge the image pdf and the text pdf (one thing to care about here is that image pdf should overlay over the text pdf.).
We have different part of code doing different work using the given image that are: i) Convert to pdf ii) OCR iii ) Draw on pdf iv) Merge pages OUTPUT Image 4. After performing the template matching, we found the number of digits using the measure label that indicate the connected area of the image.5. Then we resized each digit image to 28*28 and then used our KNN trained model to predict digits from the image.
inference.Another category of artificial intelligence that can automate tasks that the human visual system can perform is computer vision.A combination of highly trained machine learning models and computer vision engines enables handwriting OCR to mimic the way humans read handwriting.Paddle OCR is an easy-to-use and open-source OCR repository that provides ultra-lightweight OCR systems and more than 80 types of multilingual recognition models.We use Paddle OCR to read both handwritten and printed Hindi text.

Conclusion:
In conclusion, OCR technology has changed the way we deal with and communicate with textual information.Its efficiency, accuracy, availability, cost-effectiveness, and ability to integrate with other technologies make it a valuable tool across various industries, simplifying operations, improving accessibility, and opening new possibilities for data analysis and automation.In this paper, by using OCR technology, we have successfully created a searchable pdf and cheque processing which is beneficiary for many types of companies (there are many more things we can do using OCR).And tell how to create docker image of python application and show a demo to run it.In searchable PDFs are becoming increasingly important in the digital age, and their advantages over other types of documents are clear.By using OCR and automated indexing techniques, searchable PDFs can be created quickly and efficiently.Furthermore, their applications in various fields make them essential for modern document processing and management.
are many libraries which are open source and can be used to do ocr.Some examples of such types of software are Open MM Labs, lanyocr, easyocr, paddleocr, ocropus, etc. Open MM Lab is an open-source platform that aims to promote research and development in the field of multimedia machine learning.It is an initiative launched by Multimedia Laboratory (MMLab) of The Chinese University of Hong Kong (CUHK) and is currently maintained by a team of developers from MMLab and other organizations.Update the package index and install required packages: sudo apt update.2. Add the official Docker GPG key to ensure the authenticity of the Docker repository: sudo apt install apttransport-https ca-certificates curl software-properties-common 4. Add the Docker repository to APT sources: curl -fsSL https://download.docker.com/linux/ubuntu/gpg| sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg 5. Update the package index again: sudo apt update 6. Install Docker: sudo apt install docker-ce 7. Start and enable the Docker service: sudo systemctl start docker sudo systemctl enable docker 8. Verify that Docker is running: sudo systemctl status docker 9. Optionally, add your user to the "docker" group to use Docker without sudo: sudo usermod -aG docker $USER i) OCR using Open MM Lab: