What is Purifyr
Purifyr could remove 95% noise from web pages. Get contents ready for further semantic processing and information retrieval tasks.
Give it a try
Some demos: WSJ | Reuters | Guardian | USA Today | BBC | Bloomberg | ReadWriteWeb | VentureBeat | Mashable | ArtsTechnica | Inc. | ZDNet | CNN | NewYorker
Performance benchmark
- Processing speed: The average time for processing headline links from Google News is about 0.086 sec per cpu core. For a 16-core server, it takes about 0.0065 sec to process a link.
- Precision ratio: The cleaning and retain ratio is 95% for most websites. Cleaning ratio means how much 'noise' on the web page has been removed while retain ratio mens how much 'content' has been kept in the final result.
