/    Sign up×
Community /Pin to ProfileBookmark

news article headlines

I’m pulling news from several sources via RSS feeds and storing them into my database so I can display them in various places.

I’m getting the article URL, source name, headline and sometimes a intro paragraph (although not all do this)

The issue I have is some news outlets post the same stories as each other (it’s a specific topic) . Obviously I don’t want to store and display the same article 2 or 3 times. Firstly it’s annoying but it looks unprofessional and lazy too.

The obvious thing so do, is when processing a node in my RSS feed is to check if the headline is in my database and perhaps within the last week, it is, I reject it or send off for manual review as a possible duplication, if not just insert it.

The issue is some providers adjust the headline. For example (made up here of course)

CNN: “Man walks on Mars”
ABC: “BREAKING: Man walks on Mars”
NBC: “Space: Man walks on Mars”
KTLA: “NASA Astronaut takes first steps on Mars”

It’s impossible to to predict the prefix and sometimes it’s completely different like the KTLA one.

My question is:

1) How can I match the string based on similarity, something like

a) 4 consecutive works
b) 80% of the works being consecutive (4 out of 5)
c) 80% match not considering order

2) How else would I establish if two stories are on the exact same topic?
a) could I score them somehow and anything above a score got sent for review?

to post a comment
PHP

1 Comments(s)

Copy linkTweet thisAlerts:
@NogDogApr 28.2021 — The "KTLA" case might require adventures in machine learning and a large sample size. ;)

For the others, you might get something at least somewhat usable leveraging PHP's [similar_text()](https://php.net/similar_text) function, e.g:
[code=php]
<?php

$data = [
'CNN' => "Man walks on Mars",
'ABC' => "BREAKING: Man walks on Mars",
'NBC' => "Space: Man walks on Mars",
'KTLA' => "NASA Astronaut takes first steps on Mars",
];

foreach($data as $source => $headline) {
foreach($data as $source2 => $headline2) {
if($source == $source2) {
continue;
}
$similar = similar_text($headline, $headline2, $percent);
echo "Similarity ofn $headlinen $headline2n $percent%nn";
}
}
[/code]

Output:
[code=text]
Similarity of
Man walks on Mars
BREAKING: Man walks on Mars
77.272727272727%

Similarity of
Man walks on Mars
Space: Man walks on Mars
82.926829268293%

Similarity of
Man walks on Mars
NASA Astronaut takes first steps on Mars
45.614035087719%

Similarity of
BREAKING: Man walks on Mars
Man walks on Mars
77.272727272727%

Similarity of
BREAKING: Man walks on Mars
Space: Man walks on Mars
74.509803921569%

Similarity of
BREAKING: Man walks on Mars
NASA Astronaut takes first steps on Mars
44.776119402985%

Similarity of
Space: Man walks on Mars
Man walks on Mars
82.926829268293%

Similarity of
Space: Man walks on Mars
BREAKING: Man walks on Mars
74.509803921569%

Similarity of
Space: Man walks on Mars
NASA Astronaut takes first steps on Mars
34.375%

Similarity of
NASA Astronaut takes first steps on Mars
Man walks on Mars
42.105263157895%

Similarity of
NASA Astronaut takes first steps on Mars
BREAKING: Man walks on Mars
41.791044776119%

Similarity of
NASA Astronaut takes first steps on Mars
Space: Man walks on Mars
43.75%
[/code]
×

Success!

Help @kiwis spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 4.16,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,

tipper: @Samric24,
tipped: article
amount: 1000 SATS,
)...