apache/tika
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
GitHub repository with 3,792 stars and 935 forks.
Language: Java
Topics: java, tika, metadata, extraction, content