A rails gem to identify and extract tables from PDF. Toolbox9 teamed up with Kwanso to built it.
The goal was to write a gem to identify tables in a PDF document and extract the information in tables in a structured fashion. The challenge is that PDF documents don’t have elements representing tables. It’s lines and text. These lines are not necessarily straight and the table is not necessarily a standard table. It can have merged cells.
Step one was to read PDF and detect horizontal and vertical lines. Next we had to combine vertical and horizontal lines to build longest possible vertical and horizontal lines. We combine lines only if they touch each other, i.e. they visually look like the same line. After that, all vertical lines that are next to each other, and all horizontal lines enclosed by those vertical lines, are considered to be a single region. The algorithm worked on invisible lines too. Labels were treated as boxes in regions to help us with invisible tables.
An algorithm intensive project. Totally based on heuristic techniques. Felt like Rick (from Rick and Morty) trying to invent his portal gun.
- ASIM SIKKA, DEVELOPER, KWANSO
For each region, we go through the horizontal lines that span the entire width of a region. We consider two horizontal lines at a time, as they represent a row. We did the same for columns in a row.
If a column would have rows, we would mark is as a sub-region. This helped us identify merged cells.
During this process, we would form a tree of cells. This tree was later traversed to form a json with information. We wrote automated tests to mark the correct regions and extract the correct information.
More than 96% accuracy in extraction of correct tabular data