Skip to content
This repository has been archived by the owner on May 10, 2021. It is now read-only.

Latest commit

 

History

History
180 lines (147 loc) · 8.45 KB

README.md

File metadata and controls

180 lines (147 loc) · 8.45 KB

SlideOCR: Automated lecture slide recognition via OCR

This project focuses on OCR pre processing. Its goal is to extract text from a video stream of lecture slides given specific timestamps. It is part of a bigger project called Gewinnung von Meta-Daten aus multimedialen Inhalten that aims at completely automated lecture content recognition via slides and audio.

Where to start

Installation

The code is written in Python 2.7. Apart from that we need several libraries and tools. Luckily they are all available for free.

If you are on Windows 7 (64 bit) you can just use the install script. The script may also runs on other Windows platforms. Before running the script, you will have to set the pythonInstallDir in slideocr/conf/Paths.py. If you don't do so and the script can't find the python install directory then it will stop its execution and politely ask you for help. The script requires administrator privileges.

A manual installation requires the following:

  1. requests (via pip)
  2. ftfy (via pip)
  3. django (via pip)
  4. OpenCV for Python (included in install directory)
  5. Python Imaging Library (PIL)
  6. NumPy
  7. python-dateutil
  8. MySQLdb
  9. FFmpeg (included in install directory)
  10. Tesseract (included in install directory)

FFmpeg and Tesseract have to be in the systems path variable.

Configuration

Part of the OCR works via ABBYY Cloud OCR SDK so we need to configure the necessary account details. We will need a developer account. If you don't want to use the ABBYY Cloud OCR SDK, then you should consider specifying the execution parameter --skip-abbyy.

If you want to use the MySQL extractor you will also have to specify your MySQL details. The programm connects to localhost on the default MySQL port (3306) and uses the mydb database.

To set all the details, create a Secrets.py in SlideOCR/SlideOcrCode/slideocr/conf and insert the following code:

class Secrets:
	
	ABBYY_APP_ID = "<your-app-id-goes-here>"
	ABBYY_PWD = "<your-app-password-goes-here>"

	MYSQL_USER = "<your-db-user-goes-here>"
	MYSQL_PWD = "<your-db-password-goes-here>"

Execution

The most convenient way to use the program is via executing process.py.

usage: process.py [-h] [--configFile CONFIGFILE]
                  [--workingDirectory WORKINGDIRECTORY] --sourceFile
                  SOURCEFILE [-e EXTRACTION] [--videoId] [--skipAbbyy]
                  [--skipTesseract]
                  [--preProcessingBounding PREPROCESSINGBOUNDING [PREPROCESSINGBOUNDING ...]]
                  [--preProcessingOCR PREPROCESSINGOCR [PREPROCESSINGOCR ...]]
                  [--skipCleanup] [--sigmaX SIGMAX] [--sigmaColor SIGMACOLOR]
                  [--thresh THRESH] [--blockSize BLOCKSIZE] [--C C] [--px PX]
                  [--interpolationMode {nearest,bicubic,bilinear,antialias}]
                  [--minAreaSize MINAREASIZE] [--maxAreaHeight MAXAREAHEIGHT]
                  [--mergeThreshold MERGETHRESHOLD]
                  [--boxWideningValue BOXWIDENINGVALUE]
                  [--heightOffset HEIGHTOFFSET]
                  [--tesseractLanguage TESSERACTLANGUAGE]
                  [--abbyyLanguage ABBYYLANGUAGE]
                  [--abbyyBatchSize ABBYYBATCHSIZE]

Extract text from a video stream of lecture slides

optional arguments:
  -h, --help            show this help message and exit

General:
  --configFile CONFIGFILE
                        Path to config-file.
  --workingDirectory WORKINGDIRECTORY
                        Path to a directory that will be used as temporary
                        workspace
  --sourceFile SOURCEFILE
                        Path to an image, a zipped set of images or a video
                        file that will be processed. Video files require the
                        option -e
  -e EXTRACTION, --extraction EXTRACTION
                        Path to a file that contains the frame extraction data
  --videoId             Set if sourceFile is an ID of a video that contains
                        the frame extraction data
  --skipAbbyy           skips ABBYY Cloud OCR processing
  --skipTesseract       skips Tesseract OCR processing
  --preProcessingBounding PREPROCESSINGBOUNDING [PREPROCESSINGBOUNDING ...]
                        List of pre processing steps that are executed to
                        enhance image quality for bounding box algorithms.
  --preProcessingOCR PREPROCESSINGOCR [PREPROCESSINGOCR ...]
                        List of pre processing steps that are executed to
                        enhance image quality for OCR runs.
  --skipCleanup         Keep the temporary files. Defaults to false

gaussianBlurring:
  --sigmaX SIGMAX       Gaussian kernel standard deviation in X direction. The
                        higher sigmaX, the stronger the blur effect. Should be
                        a value between 0 and 3.

bilateralFiltering:
  --sigmaColor SIGMACOLOR
                        Filter sigma in the color space. A larger value of the
                        parameter means that farther colors within the pixel
                        neighborhood (see sigmaSpace ) will be mixed together,
                        resulting in larger areas of semi-equal color.The
                        higher sigmaColor, the stronger the blur effect.

simpleThresholding:
  --thresh THRESH       threshold value.

adaptiveThresholding:
  --blockSize BLOCKSIZE
                        Size of a pixel neighborhood that is used to calculate
                        a threshold value for the pixel: 3, 5, 7, and so on.
  --C C                 Constant subtracted from the mean or weighted mean
                        (see the details below). Normally, it is positive but
                        may be zero or negative as well. Reduces Noises.

opening:
  --px PX               Discarded pixel.

Interpolation:
  --interpolationMode {nearest,bicubic,bilinear,antialias}
                        Interpolationsmodus.

BoundingBoxes:
  --minAreaSize MINAREASIZE
                        Minimal size of a text area. For detecting small
                        characters use a small value, but you will get quite
                        more boxes as result.
  --maxAreaHeight MAXAREAHEIGHT
                        Maximal height of a text area. For detecting large
                        characters, use a large value, but than lines with
                        small characters will be combined in one box, if the
                        sum of their heights is smaller than this
                        maxAreaHeight.
  --mergeThreshold MERGETHRESHOLD
                        Merging threshold to combine bounding boxes.
  --boxWideningValue BOXWIDENINGVALUE
                        Padding for bounding boxes.

textClassification:
  --heightOffset HEIGHTOFFSET
                        Distance of caption and footing to the average text
                        height.

ocrOptions:
  --tesseractLanguage TESSERACTLANGUAGE
                        Tesseract recognition language. Languages are set as
                        three character words (eng, deu, fra etc). Default is
                        eng.
  --abbyyLanguage ABBYYLANGUAGE
                        ABBYY recognition language. Languages are set as full
                        word (English, German, French etc). Default is
                        English.
  --abbyyBatchSize ABBYYBATCHSIZE
                        Number of ABBYY tasks that are processed in parallel.