emiruz/textextract
textextract is a tiny library (87 lines of Go) that identifies where the article content is in a HTML page (as opposed to navigation, headers, footers, ads, etc), extracts it and returns it as a string. Like Boilerpipe but for Go in Go.
GitHub repository with 11 stars and 2 forks.
Language: Go
Topics: text-mining, nlp