Table scraping from pdf
Project details
We have a number of pdf files (similar for format ) and would like to extract data from a table that on a page in the document, the page differs by document depending on what else is in the document (it’s typically between pages 3 and 5) but there are some identifying characteristics. Will also need to capture the timestamp from the first page.
Data should end up in a pandas dataframe, timestamp from the document should appear in each row, one field may have blanks these should be populated with value from the proceeding row.
New documents will be placed in a “unprocessed” folder, between 1 and 40 documents at a time, once the code has run they should be moved by the application to a “processed” folder.
We will need to rerun this on as new documents are created.
Solution should use only open source solutions. Our preferred IDE is PyCharm, we’ll need to agree on any modules that will be used in the solution. We will add the code to push the extracted data to our database.
Will be run manually initially but may will be run on windows initially but may be operationalized on a docker.
3/4 documents new are created per day, it takes 30 minutes weekly to manually extract the data, purpose of this project is to save that time.
Awarded to:

M. Ali Masyhur K.
(4.7)
Awarded to:

M. Ali Masyhur K.
(4.7)
Project details
Data should end up in a pandas dataframe, timestamp from the document should appear in each row, one field may have blanks these should be populated with value from the proceeding row.
New documents will be placed in a “unprocessed” folder, between 1 and 40 documents at a time, once the code has run they should be moved by the application to a “processed” folder.
We will need to rerun this on as new documents are created.
Solution should use only open source solutions. Our preferred IDE is PyCharm, we’ll need to agree on any modules that will be used in the solution. We will add the code to push the extracted data to our database.
Will be run manually initially but may will be run on windows initially but may be operationalized on a docker.
3/4 documents new are created per day, it takes 30 minutes weekly to manually extract the data, purpose of this project is to save that time.