/    Sign up×
Community /Pin to ProfileBookmark

Can I develop my own standalone search engine?

I have about 80 old school magazines in searchable pdf files. I want to put these on a disc and then add a search facility specifically for the files on the disc.

I do not want to use an external facility, although I would like it to look like those on offer.

A very long-winded way would be to use thw windows Seach facility to locate the files and then search each revealed file individually.

Any suggestions? Is this the right forum?

to post a comment
SEO

2 Comments(s)

Copy linkTweet thisAlerts:
@sohguanhMay 14.2010 — I have about 80 old school magazines in searchable pdf files. I want to put these on a disc and then add a search facility specifically for the files on the disc.

I do not want to use an external facility, although I would like it to look like those on offer.

A very long-winded way would be to use thw windows Seach facility to locate the files and then search each revealed file individually.

Any suggestions? Is this the right forum?[/QUOTE]


Do you want to be able to search for that particular PDF file OR do you want to search for keywords within each PDF file ?

Search for particular PDF file should not be very difficult depending on the computer language and platform you want.

Search for keywords within each PDF file will be trickier but I have explored a Index/Search engine offered at Apache called Lucene.

Unfortunately, Lucene does not come with extraction features. You may need the sub-project Tika to help you.

Step 1

Use Tika to search and then extract keywords from your PDF files

Step 2

Based on Step 1 results, feed the keywords into Lucene engine

Step 3

Use Lucene to do index and then you search keywords using Lucene

Above assume you are a developer and comfortable with Java. Lucene is a library/API, it is not a complete product. You need to write code to "interface" with it. If you want a out of the box Index/Search server that uses Lucene underlying, you can try Apache Solr or Apache Nutch which are finished products for use.

You can visit below website to understand more.

http://lucene.apache.org/

http://nutch.apache.org/ - promoted to top-level Apache project 11 May 2010

http://tika.apache.org/ - promoted to top-level Apache project 11 May 2010

In times to come, all of them will be good Open Source alternatives.
Copy linkTweet thisAlerts:
@spiresgateauthorMay 14.2010 — Thanks so much for the quick reply. There's much food for thought and I will explore Lucene.
×

Success!

Help @spiresgate spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 5.25,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,

tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,
)...