Harvest News Like a Pro: Introducing News-Please, Your Open-Source Solution for News Extraction and Archiving
news-please is an open-source news crawler that extracts structured information from news websites. It uses libraries like scrapy, Newspaper, and readability, and can follow internal hyperlinks and read RSS feeds to fetch both recent and archived articles.
It also features a library mode for Python developers and can extract articles from the large news archive at commoncrawl.org.
Features
- works out of the box: install with pip, add URLs of your pages, run
- run news-please conveniently using its CLI mode
- use it as a library within your own software
- extract articles from commoncrawl.org's news archive
- stores extracted results in JSON files, PostgreSQL, ElasticSearch, or your own storage
- simple but extensive configuration (if you want to tweak the results)
- revisions: crawl articles multiple times and track changes
- crawl and extract information given a list of article URLs
- to use news-please within your own Python code
Extracted information
news-please extracts the following attributes from news articles. An examplary json file as extracted by news-please can be found here.
- headline
- lead paragraph
- main text
- main image
- name(s) of author(s)
- publication date
- language
Install
$ pip3 install news-please
License
Apache-2.0 License