RadioBanter

RadioBanter (https://www.radiobanter.com/)
-   Digital (https://www.radiobanter.com/digital/)
-   -   Question on web spiders (https://www.radiobanter.com/digital/71393-question-web-spiders.html)

Caveat Lector May 22nd 05 01:40 AM

Question on web spiders
 
Maybe not the right place but seems there are several web experts here.

Can web spiders read and harvest e-mail addresses from a pdf file ?

Many users and folks like QRZ.com are using jpegs not ascii for listing
e-mails -- this seems to work.

So for pdf files without going to a jpeg --- are ascii text addresses
harvestable ?

Thanks



--
CL -- I doubt, therefore I might be !








Mike Andrews May 22nd 05 03:35 AM

Caveat Lector wrote:
Maybe not the right place but seems there are several web experts here.


Can web spiders read and harvest e-mail addresses from a pdf file ?


Many users and folks like QRZ.com are using jpegs not ascii for listing
e-mails -- this seems to work.


So for pdf files without going to a jpeg --- are ascii text addresses
harvestable ?


Yes, in the sense that Optical Character Recognition (OCR) programs _can_
read text out of an image. In practice, it's not worth the spammers' or
web spider operators' trouble -- or that's been my experience, anyway.

YMMV.

--
Mike Andrews, W5EGO

Tired old sysadmin

Paul Rubin May 22nd 05 03:43 AM

(Mike Andrews) writes:
So for pdf files without going to a jpeg --- are ascii text addresses
harvestable ?


Yes, in the sense that Optical Character Recognition (OCR) programs _can_
read text out of an image. In practice, it's not worth the spammers' or
web spider operators' trouble -- or that's been my experience, anyway.


PDF files contain the underlying text strings and search engines index
them without OCR'ing. Whether spammers bother, I don't know.

Mike Andrews May 23rd 05 02:46 AM

Paul Rubin wrote:
(Mike Andrews) writes:
So for pdf files without going to a jpeg --- are ascii text addresses
harvestable ?


Yes, in the sense that Optical Character Recognition (OCR) programs _can_
read text out of an image. In practice, it's not worth the spammers' or
web spider operators' trouble -- or that's been my experience, anyway.


PDF files contain the underlying text strings and search engines index
them without OCR'ing. Whether spammers bother, I don't know.


Hi, Paul. Long time no see.

Depends on whether they're text-based PDF or image-based PDF. If I scan
a page into a JPEG or TIFF and then convert that to PDF, it may not have
any of the text as text, and I think it's improbable that it will.

--
Mike Andrews, W5EGO

Tired old sysadmin

Paul Rubin May 23rd 05 02:50 AM

(Mike Andrews) writes:
Depends on whether they're text-based PDF or image-based PDF. If I scan
a page into a JPEG or TIFF and then convert that to PDF, it may not have
any of the text as text, and I think it's improbable that it will.


Oh, I see, yes that would be about the same as a TIFF, but I don't
understand why you'd bother.


All times are GMT +1. The time now is 10:37 PM.

Powered by vBulletin® Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
RadioBanter.com