This page (revision-61) was last changed on 19-Sep-2022 11:15 by Arnab Ghosh Chowdhury

This page was created on 10-May-2022 15:30 by Arnab Ghosh Chowdhury

Only authorized users are allowed to rename pages.

Only authorized users are allowed to delete pages.

Page revision history

Version Date Modified Size Author Changes ... Change note
61 19-Sep-2022 11:15 5 KB Arnab Ghosh Chowdhury to previous

Page References

Incoming links Outgoing links

Version management

Difference between version and

At line 1 changed one line
[{Image src='Data Extractor/MDE_1_Home.png' width=600}]
[{Image src='Data Extractor/Di-Plast_IndexPage.PNG' width=600}]
At line 8 changed one line
__%%( color: #003399; font-size: 16px;)Type of tool:__ Web application to be deployed on your computer that supports Linux operating system.
__%%( color: #003399; font-size: 16px;)Type of tool:__ Web application
At line 10 changed one line
__%%( color: #003399; font-size: 16px;)Short description of the tool: __ Extract tabular data and textual data from product technical datasheets (PDF documents)
__%%( color: #003399; font-size: 16px;)Short description of the tool: __ Extract tabular data from product technical datasheets (PDF documents)
At line 12 changed one line
Matrix Data Extractor (MDE) is a web-based application, which can be deployed on your computer. It identifies document table regions on PDF documents using Computer Vision based Deep Learning, especially Transfer Learning and Object Detection algorithm. It extracts all textual data into text files by applying Optical Character Recognition (OCR) and also extracts tabular data separately in excel files using Camelot python package. It supports to transfer manufacturer names and corresponding technical datasheets names (or PDF filenames) to MongoDB database table for further processing.
Matrix Data Extractor (MDE) is a web-based application that identifies document table regions on PDF documents using Computer Vision based Deep Learning algorithm and extracts data to text files by applying Optical Character Recognition (OCR). It supports to transfer extracted data to MongoDB database tables. A search functionality is also provided to retrieve data on user interface based on Keyword matching (e.g. Manufacturer Name, Technical Datasheet Name, Keyword for Table Data).
At line 14 removed one line
At line 16 changed 3 lines
- Linux OS (operating system)\\
- Elementary (Normal) User: No programming\\
- Advanced User: Python, Basic Deep Learning (PyTorch), Shell scripting\\
- Elementary User: No programming\\
- Advanced User: Python, Basic Deep Learning (PyTorch)\\
At line 20 changed one line
\\__%%( color: #003399; font-size: 16px;)Required programs %%( color: #003000; font-size: 14px;)(step-by-step guide and links provided in GitHub and user guideline blow): __
\\__%%( color: #003399; font-size: 16px;)Required programs %%( color: #003000; font-size: 14px;)(step-by-step guide and links provided in user guideline blow): __
At line 22 changed one line
\\- Shell script
\\- Java
At line 24 changed 2 lines
\\- Linux
\\- Code from GitHub ([https://github.com/cslab-hub/MatrixDataExtractor])
\\- Tool files from the GitHub (link below)
At line 28 changed one line
Any support to provide table detection model will not be provided unfortunately after project completion. The accuracy of table detection model depends on various factors such as volume, variety of annotated datasets, hyperparameters of model. You can do your experiment to get better accuracy of your table detection model. To get table detection model weight on Di-Plast dataset, you can request to Semantic Information Systems Research Group, Osnabrueck University, Osnabrueck, Germany ( [https://www.informatik.uni-osnabrueck.de/arbeitsgruppen/semantische_informationssysteme.html|https://www.informatik.uni-osnabrueck.de/arbeitsgruppen/semantische_informationssysteme.html] ).
Any support to provide table detection model will not be provided unfortunately after project completion. The accuracy of table detection model depends on various factors such as volume, variety of annotated datasets, hyperparameters of model. You can do your experiment to get better accuracy of your table detection model.
At line 30 removed one line
At line 33 changed 6 lines
[Data Extractor/MDE_Home.png]
[Data Extractor/MDE_SyncData.png]
[Data Extractor/MDE_DataInfoExt_1.png]
[Data Extractor/MDE_DataInfoExt_2.png]
[Data Extractor/MDE_DataInfoExt_3.png]
[Data Extractor/MDE_TableDataExt.png]
At line 42 changed 2 lines
\\ - Extract textual and tabular information from PDF documents.
\\ - ⚠️ For brief overview about the tool, we recommend to open and save the presentation before proceeding: [Data Extractor/Di-Plast_MDE_UI.pdf]
- (Text)
At line 35 added one line
At line 46 changed one line
\\ - Open-source document table detection tools are not suitable enough to extract tabular information from PDF documents by considering all possible document templates and table templates. Due to diverse document templates and table templates, computer vision and transfer learning based document table detection emerged significantly. This tool helps to extract textual and tabular data (in excel files) from your domain specific dataset. The extracted data can be used in Big Data technologies and Natural Language Processing (NLP).
- (Text)
At line 50 changed one line
\\ - Get GitHub [https://github.com/cslab-hub/MatrixDataExtractor], copy code into your computer, prepare your annotated dataset, build or request about table detection model weight and model description file, and start using it\\
\\ - The tool can be accessed throughout the following link: [https://share.streamlit.io/cslab-hub/data_validation/main/main.py]
\\- Get the code/installation files from github [https://cslab-hub-data-validation-main-bx6ggw.streamlitapp.com/] and start using the app by browsing through the pages.
At line 45 added 4 lines
Get the GitHub [https://github.com/cslab-hub/MatrixDataExtractor], copy
the code into your computer, prepare your annotated dataset and start using it\\