/    Sign up×
Community /Pin to ProfileBookmark

Strip out content from Microsoft files using PHP

Has anyone tried to use PHP to strip out textual content from any Microsoft documents, such as Word or Powerpoint?

We currently use “pdftotext” for stripping out textual content from PDF documents but in producing a searchable file library it would be nice if we could search / strip out content from Microsoft files.

Anyone?

to post a comment
PHP

5 Comments(s)

Copy linkTweet thisAlerts:
@ShrineDesignsJul 20.2004 — don't use microsoft products an you wont have to deal with this problem
Copy linkTweet thisAlerts:
@iamlucky13Jul 20.2004 — I thought I had a good idea, but after a quick check, I don't know how easy it would be to make it work.

If you open a word document in notepad, you'll there is a lot of funny characters (eg: ÕÍÕœ.) and content appears as regular text. The idea was to search the files for letters and numbers only, and fill that into a database. Unfortunately, words like Microsoft, Title, inventory, and table show up in the file, so you would have to deal with those somehow, as well. Plus, I only tried it with plain text. I don't know what you get when your .doc has images, tables, wordart, etc. in it.

Anyhow, Google does it, so it's obviously possible if you know the ancient secrets of the Perl Mages.
Copy linkTweet thisAlerts:
@Kyleva2204Jul 20.2004 — theres a PHP program on hotscripts.com that can do this.. I cant remember what its called.. but just look for it.. Im sure u can find it..
Copy linkTweet thisAlerts:
@Kyleva2204Jul 20.2004 — http://www.hotscripts.com/Detailed/13628.html theres a lil sumtin sumtin that can strip content from MS Word.. but im not sure if it can search and stip.. :-
Copy linkTweet thisAlerts:
@pingu_lettersauthorJul 21.2004 — Cheers guys. I did write a quick and dirty regular expression that stripped out characters, numbers and some punctuation as a temporary measure but will now head over to hotscripts and see what I can find.
×

Success!

Help @pingu_letters spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 6.1,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,

tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,
)...