news article headlines

@kiwisApr 27.2021

I’m pulling news from several sources via RSS feeds and storing them into my database so I can display them in various places.

I’m getting the article URL, source name, headline and sometimes a intro paragraph (although not all do this)

The issue I have is some news outlets post the same stories as each other (it’s a specific topic) . Obviously I don’t want to store and display the same article 2 or 3 times. Firstly it’s annoying but it looks unprofessional and lazy too.

The obvious thing so do, is when processing a node in my RSS feed is to check if the headline is in my database and perhaps within the last week, it is, I reject it or send off for manual review as a possible duplication, if not just insert it.

The issue is some providers adjust the headline. For example (made up here of course)

CNN: “Man walks on Mars”
ABC: “BREAKING: Man walks on Mars”
NBC: “Space: Man walks on Mars”
KTLA: “NASA Astronaut takes first steps on Mars”

It’s impossible to to predict the prefix and sometimes it’s completely different like the KTLA one.

My question is:

1) How can I match the string based on similarity, something like

a) 4 consecutive works
b) 80% of the works being consecutive (4 out of 5)
c) 80% match not considering order

2) How else would I establish if two stories are on the exact same topic?
a) could I score them somehow and anything above a score got sent for review?

to post a comment

PHP

@NogDogApr 28.2021 — #The "KTLA" case might require adventures in machine learning and a large sample size. ;)

For the others, you might get something at least somewhat usable leveraging PHP's [similar_text()](https://php.net/similar_text) function, e.g:

[code=php]
 <?php
 
 $data = [
 'CNN' => "Man walks on Mars",
 'ABC' => "BREAKING: Man walks on Mars",
 'NBC' => "Space: Man walks on Mars",
 'KTLA' => "NASA Astronaut takes first steps on Mars",
 ];
 
 foreach($data as $source => $headline) {
 foreach($data as $source2 => $headline2) {
 if($source == $source2) {
 continue;
 }
 $similar = similar_text($headline, $headline2, $percent);
 echo "Similarity ofn  $headlinen  $headline2n    $percent%nn";
 }
 }
 [/code]

Output:

[code=text]
 Similarity of
 Man walks on Mars
 BREAKING: Man walks on Mars
 77.272727272727%
 
 Similarity of
 Man walks on Mars
 Space: Man walks on Mars
 82.926829268293%
 
 Similarity of
 Man walks on Mars
 NASA Astronaut takes first steps on Mars
 45.614035087719%
 
 Similarity of
 BREAKING: Man walks on Mars
 Man walks on Mars
 77.272727272727%
 
 Similarity of
 BREAKING: Man walks on Mars
 Space: Man walks on Mars
 74.509803921569%
 
 Similarity of
 BREAKING: Man walks on Mars
 NASA Astronaut takes first steps on Mars
 44.776119402985%
 
 Similarity of
 Space: Man walks on Mars
 Man walks on Mars
 82.926829268293%
 
 Similarity of
 Space: Man walks on Mars
 BREAKING: Man walks on Mars
 74.509803921569%
 
 Similarity of
 Space: Man walks on Mars
 NASA Astronaut takes first steps on Mars
 34.375%
 
 Similarity of
 NASA Astronaut takes first steps on Mars
 Man walks on Mars
 42.105263157895%
 
 Similarity of
 NASA Astronaut takes first steps on Mars
 BREAKING: Man walks on Mars
 41.791044776119%
 
 Similarity of
 NASA Astronaut takes first steps on Mars
 Space: Man walks on Mars
 43.75%
 [/code]

Success!

Help @kiwis spread the word by sharing this article on Twitter...

Tweet This

about: ({
version: 0.1.9 — BETA 4.16,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});

changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...

recent_tips: (
tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,

tipper: @Samric24,
tipped: article
amount: 1000 SATS,
)...

news article headlines

1 Comments(s) _↴

Also in #PHP _↴

Success!

Social

Version

news article headlines

1 Comments(s) ↴

Also in #PHP ↴

Success!

The web is an endless sea of information. Don't miss the boat... Subscribe!

Social

Version

1 Comments(s) _↴

Also in #PHP _↴