본문 바로가기
공부/Python

[python] PDF에서 Text 추출하기 (Extract elements from a PDF using Python)

by 병진들 2021. 3. 10.

Library Name

pdfminer.six 

 

Document | Source

https://pdfminersix.readthedocs.io/en/latest/index.html

 

How to Install

# pip install pdfminer.six

 

1. pdf elements 까지 전부 추출

from pdfminer.high_level import extract_pages
for page_layout in extract_pages("test.pdf"):
    for element in page_layout:
        print(element)

Element 종류와 분류 알고리즘은 여기가면 볼 수 있음

 

2. Text만 추출

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

 

 

댓글