Old stamp magazines - scan - OCR - web images - any ideas?

#1 February 28th 05, 05:10 AM

Does anyone have any experience and suggestions on how to better scan
in and process old stamp collecting newspapers for the web?

Questions:

a. What dpi resolution is needed for decent scans of 6 point text? (Old
newspapers use smaller type)

b. What is the best way to process the grayscale image down from 256
shades to monochrome?

c. Is decent quality OCR possible on a gray background newsprint 6
point text at anything below 300 dpi?

d. What is the best way to scan in pages larger than your scanner
(e.g., 11x13 inches on an 8.5x11 inch flatbed scanner)?

e. Is there a better way to get straight scans? I line up the edge of
the paper with the edge of the scanner glass and close the lid but keep
getting scans tilted. This is troublesome because old newspapers are
usually not printed exactly straight horizontally or vertically due to
bending of the paper during printing.

For example, I processed a public domain 1911 issue of Meekel's Weekly
Stamp News as follows:

a. Scan top half of a page in 300 dpi grayscale with de-screen turned
on
b. Scan bottom half of a page - same settings
c. Rotate both scans using Gimp (GNU Imaging Program)
d. Join images together by hand:
d1. Open top half image
d2. Double its height
d3. Open bottom half image
d4. Copy bottom half image into blank area below top half image
d5. (troublesome) Join the two halfs by trimming off top part of
bottom image. I tried a photo stitch program but it failed unless each
image was exactly straight.
e. Smooth background colors (grays)
f. Adjust histogram
g. Adjust curve
h. Threshold at about 160 out of 255 to get most of the black colors
i. Save as a monochrome bitmap file and a monochrome gif file

Thanks

#2 February 28th 05, 05:40 AM

Whew, some tough questions there

Just a few observations,
upgrading your scanning equipment
can be one way, if you have deep pockets.
I have seen machines that will accommodate A3 and automatically
turn the pages of books/magazines.

For us mere mortals, then the standard scanner today can usually
fit magazines OK, mine (Epson Perfection 1670) demands a minimum
of 266 dpi for OCR, and successfully translates even the most toned
newsprint with small type.

For an A4 scan, I usually use 150dpi for full page, and save as an *.jpg
which translates to about 200Kb per scan. I never use grayscale,
just scan all my newspapers as "reflective" document type,
and "photo" auto exposure type.

Keep an eye out for new machine releases, Fujitsu has a new model called
"scansnap" I believe.

Not much help, just a few observations

Good luck.

| Does anyone have any experience and suggestions on how to better scan
| in and process old stamp collecting newspapers for the web?
|
| Questions:
|
| a. What dpi resolution is needed for decent scans of 6 point text? (Old
| newspapers use smaller type)
|
| b. What is the best way to process the grayscale image down from 256
| shades to monochrome?
|
| c. Is decent quality OCR possible on a gray background newsprint 6
| point text at anything below 300 dpi?
|
| d. What is the best way to scan in pages larger than your scanner
| (e.g., 11x13 inches on an 8.5x11 inch flatbed scanner)?
|
| e. Is there a better way to get straight scans? I line up the edge of
| the paper with the edge of the scanner glass and close the lid but keep
| getting scans tilted. This is troublesome because old newspapers are
| usually not printed exactly straight horizontally or vertically due to
| bending of the paper during printing.
|
| For example, I processed a public domain 1911 issue of Meekel's Weekly
| Stamp News as follows:
|
| a. Scan top half of a page in 300 dpi grayscale with de-screen turned
| on
| b. Scan bottom half of a page - same settings
| c. Rotate both scans using Gimp (GNU Imaging Program)
| d. Join images together by hand:
| d1. Open top half image
| d2. Double its height
| d3. Open bottom half image
| d4. Copy bottom half image into blank area below top half image
| d5. (troublesome) Join the two halfs by trimming off top part of
| bottom image. I tried a photo stitch program but it failed unless each
| image was exactly straight.
| e. Smooth background colors (grays)
| f. Adjust histogram
| g. Adjust curve
| h. Threshold at about 160 out of 255 to get most of the black colors
| i. Save as a monochrome bitmap file and a monochrome gif file
|
| Thanks
|

#3 February 28th 05, 02:18 PM

a. What dpi resolution is needed for decent scans of 6 point text? (Old
newspapers use smaller type)

150 will do, you might try 200

b. What is the best way to process the grayscale image down from 256
shades to monochrome?

Don't. Leave it grayscale, or you loose too much info. Maybe you can size
the files down by going for e.g. 64 grayscades, you can try.

c. Is decent quality OCR possible on a gray background newsprint 6
point text at anything below 300 dpi?

Again, try. Automatic contrast settings for most scanners do a decent job,
but the question is too general for a specific answer.

d. What is the best way to scan in pages larger than your scanner
(e.g., 11x13 inches on an 8.5x11 inch flatbed scanner)?

Dunno.

e. Is there a better way to get straight scans? I line up the edge of
the paper with the edge of the scanner glass and close the lid but keep
getting scans tilted. This is troublesome because old newspapers are
usually not printed exactly straight horizontally or vertically due to
bending of the paper during printing.

If you're willing to buy a sheet feeder and loosen the leaves of the
magazines, you could try that. Otherwise it's a lot of manual labor.

Good luck
Jan

#4 February 28th 05, 07:56 PM

wrote in message
oups.com...
Does anyone have any experience and suggestions on how to better scan
in and process old stamp collecting newspapers for the web?

Questions:

a. What dpi resolution is needed for decent scans of 6 point text? (Old
newspapers use smaller type)

b. What is the best way to process the grayscale image down from 256
shades to monochrome?

c. Is decent quality OCR possible on a gray background newsprint 6
point text at anything below 300 dpi?

d. What is the best way to scan in pages larger than your scanner
(e.g., 11x13 inches on an 8.5x11 inch flatbed scanner)?

e. Is there a better way to get straight scans? I line up the edge of
the paper with the edge of the scanner glass and close the lid but keep
getting scans tilted. This is troublesome because old newspapers are
usually not printed exactly straight horizontally or vertically due to
bending of the paper during printing.

For example, I processed a public domain 1911 issue of Meekel's Weekly
Stamp News as follows:

You seem to be aware of the copyright issue, since publishers are likely to
be on to you like the proverbial ton of bricks if you place anything still
in copyright on the web without permission.

Best of luck, Roger

#5 March 1st 05, 02:33 AM

If you have Adode Acrobat or PaperPort, you can save them in PDF format.
You can extract the text from these as well.

Al

wrote:
Does anyone have any experience and suggestions on how to better scan
in and process old stamp collecting newspapers for the web?

Questions:

a. What dpi resolution is needed for decent scans of 6 point text? (Old
newspapers use smaller type)

b. What is the best way to process the grayscale image down from 256
shades to monochrome?

c. Is decent quality OCR possible on a gray background newsprint 6
point text at anything below 300 dpi?

d. What is the best way to scan in pages larger than your scanner
(e.g., 11x13 inches on an 8.5x11 inch flatbed scanner)?

e. Is there a better way to get straight scans? I line up the edge of
the paper with the edge of the scanner glass and close the lid but keep
getting scans tilted. This is troublesome because old newspapers are
usually not printed exactly straight horizontally or vertically due to
bending of the paper during printing.

For example, I processed a public domain 1911 issue of Meekel's Weekly
Stamp News as follows:

a. Scan top half of a page in 300 dpi grayscale with de-screen turned
on
b. Scan bottom half of a page - same settings
c. Rotate both scans using Gimp (GNU Imaging Program)
d. Join images together by hand:
d1. Open top half image
d2. Double its height
d3. Open bottom half image
d4. Copy bottom half image into blank area below top half image
d5. (troublesome) Join the two halfs by trimming off top part of
bottom image. I tried a photo stitch program but it failed unless each
image was exactly straight.
e. Smooth background colors (grays)
f. Adjust histogram
g. Adjust curve
h. Threshold at about 160 out of 255 to get most of the black colors
i. Save as a monochrome bitmap file and a monochrome gif file

Thanks

#6 March 1st 05, 06:16 AM

That's news! How do you do that?

| If you have Adode Acrobat or PaperPort, you can save them in PDF format.
| You can extract the text from these as well.
| Al

#7 March 1st 05, 10:49 PM

To extract text from an Adobe document, highlight the "select text"
tool, select the text you want, press CTRL+C, open the program where
you want to save the text, press CTRL+V, and it will paste. Formats
usually don't hold well, but it works.

If it's a protected Acrobat page on display, you can use a screen
image capture program like Kleptomania, draw a rectangle around the
port you want that's on the screen, and copy it as an image to any
image software. Resolution is limited to screen resolution, but it's
good enough to OCR with if the text isn't too small. You can only
capture what's visible with most of these, and may wind up pasting the
parts together if the acrobat image is goes beyond the screen width or
height.

LN

On Tue, 1 Mar 2005 14:16:49 +0800, "Rodney"
wrote:

That's news! How do you do that?

| If you have Adode Acrobat or PaperPort, you can save them in PDF format.
| You can extract the text from these as well.
| Al

#8 March 2nd 05, 01:05 AM

You purchase either software product and then perform the capture using
the scanner. Both programs create a PDF file. From this you can extract
text. It does do an OCR on the file so it is not always 100% perfect to
create the PDF.

Acrobat allows you extract text from any non-protected PDF but there are
shareware/freeware programs that can do it as well. Acrobat is pricey
for any version. I am sure other PDF generation programs can do this as
well.

For example the Scott Catalogue in PDF format is protected so you cannot
readily extract the text.

The advantage of this capture is that it is a way to take your clippings
or web pages and then save them electronically on your pc rather than
yellowing (like clippings from newsprint) in a folder in your desk. The
other advantage is you can search on text, etc.

I plan to write an article on this in an upcoming issue of "The
Compulatelist", the quarterly newsletter of PCSG (www.pcsg.org).

Al

Rodney wrote:
That's news! How do you do that?

| If you have Adode Acrobat or PaperPort, you can save them in PDF format.
| You can extract the text from these as well.
| Al

#9 March 2nd 05, 02:23 AM

I'll retain your reponse.

I had a feeling the route was to OCR the capture, which to my
mind is not really "extracting text" per se.
I now understand some PDF files are not protected, but must be
in the minority.
I was reading a recent topical thread on Win98se.discuss NG
and there were no solutions offered on extracting text (normally) from PDF
from any of their vast readership.

Thanks for the extended reply.

| You purchase either software product and then perform the capture using
| the scanner. Both programs create a PDF file. From this you can extract
| text. It does do an OCR on the file so it is not always 100% perfect to
| create the PDF.
|
| Acrobat allows you extract text from any non-protected PDF but there are
| shareware/freeware programs that can do it as well. Acrobat is pricey
| for any version. I am sure other PDF generation programs can do this as
| well.
|
| For example the Scott Catalogue in PDF format is protected so you cannot
| readily extract the text.
|
| The advantage of this capture is that it is a way to take your clippings
| or web pages and then save them electronically on your pc rather than
| yellowing (like clippings from newsprint) in a folder in your desk. The
| other advantage is you can search on text, etc.
|
| I plan to write an article on this in an upcoming issue of "The
| Compulatelist", the quarterly newsletter of PCSG (www.pcsg.org).
|
| Al
|
| Rodney wrote:
| That's news! How do you do that?
|
|
| | If you have Adode Acrobat or PaperPort, you can save them in PDF format.
| | You can extract the text from these as well.
| | Al
|
|
|

#10 March 3rd 05, 03:47 AM

These are the specifications of the Fujitsu "scan snap"
if you have approx $800 burning a hole in your pocket.

Description
The ScanSnap fi-5110EOX is a true one-touch solution which enables you to scan directly to PDF, email or file with the
touch of a button. This is a 50-page Automatic Document Feeder (ADF) with fast monochrome and color scan rates of up to
15 pages per minute (ppm) / 30 images per minute (IPM) and 600dpi optical resolution. It features automatic color, page
size and length detection. ScanSnap automatically straightens and aligns text and images into their correct orientation
and also automatically detects the following paper sizes: A4, B5, A5, B6, A6, Business Card, Legal and Letter. Since
ScanSnap automatically recognizes and eliminates blank pages, each scanning job progresses smoothly, even when scanning
combinations of one-sided and two-sided documents. ScanSnap automatically separates color documents from black and white
ones and saves this information in highly compressed files, thus saving storage space.
Key Features
Type Path-Through Scanner

Interface USB 2.0

Optical Resolution 1200 dpi

Max. Resolution (Hardware) 1200 x 1200 dpi

Max. Resolution (Interpolated) 600 x 600 dpi

Platform PC
Technical Features
Form Factor Desktop

Scan Element Type CCD

Input Type Color

Special Features OCR Capability

Automatic Document Feeder Capacity 50 Pages
Media
Supported Media Type Business Cards, Plain Paper

Media Load Type Automatic Document Feeder

Max. Supported Media Size Legal (216 x 356 mm)
System Requirements
Operating System Microsoft Windows 2000, Microsoft Windows 98, Microsoft Windows 98 SE, Microsoft Windows XP Home
Edition, Microsoft Windows XP Professional
Dimensions
Width 11.2 in.

Depth 5.7 in.

Height 5.9 in.

Weight 5.9 lb.
Warranty
Warranty 1 Year
Miscellaneous

Included Accessories Automatic Document Feeder, USB Cable,Adobe Acrobat,ScanSnap Scanning Software,PFU Business
CardMinder,ScanSnap Specific drivers

| Does anyone have any experience and suggestions on how to better scan
| in and process old stamp collecting newspapers for the web?

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Non-Sports Cards to Trade, Sell or Buy	Susan O'Fearna	Cards:- non-sport	0	October 30th 04 05:40 AM
New Finland Stamp Issue	Stamp Master Album	US Stamps	0	May 29th 04 11:38 AM
Poggiali World Champion 250cc Stamp Pane	Stamp Master Album	US Stamps	0	April 24th 04 11:42 AM
FS: Non-Sports PROMO Cards/Sets/Sheets 1994 Part 2	J.R. Sinclair	Cards:- non-sport	0	March 22nd 04 06:02 AM
[Fwd: FA Stampoffers]	Doug Buss	Marketplace	0	October 11th 03 02:24 AM