TL.DR Build the next generation Open source Python NewsPaper framework https://newspaper.readthedocs.io
Delivery time: 60 days
Gigs Completed: 0Buy gig (Ξ 1.414)
I am looking for a Student who needs a thesis project for their University and wants to do this with an outside company.
TL.DR Build the next generation Open source Python NewsPaper framework https://newspaper.readthedocs.io/en/latest/
I have a commercial interest in this technology moving forward, the current Python author detection libraries are bad, getting the author wrong 70% of the time. That is why I am funding an open source project.
1. Extract author name
2. Detect is author name present
3. Extract bounds of the article (aka get the text of the article without any of the comments or sidebars around it )
Some Related Publications on author extraction :
We will test the quality of your new library by comparing against hand labeled correct grown truth data and comparing to the leading paid API : https://www.diffbot.com/dev/docs/article/
Secondary Optional Features :
4. Detect article publication time stamp.
5. Detect Title of article
6. Detect language
7. Get Keywords
8. Summarize article
9. Get the list of outbound links inside the article text