If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Old stamp magazines - scan - OCR - web images - any ideas?
Does anyone have any experience and suggestions on how to better scan
in and process old stamp collecting newspapers for the web? Questions: a. What dpi resolution is needed for decent scans of 6 point text? (Old newspapers use smaller type) b. What is the best way to process the grayscale image down from 256 shades to monochrome? c. Is decent quality OCR possible on a gray background newsprint 6 point text at anything below 300 dpi? d. What is the best way to scan in pages larger than your scanner (e.g., 11x13 inches on an 8.5x11 inch flatbed scanner)? e. Is there a better way to get straight scans? I line up the edge of the paper with the edge of the scanner glass and close the lid but keep getting scans tilted. This is troublesome because old newspapers are usually not printed exactly straight horizontally or vertically due to bending of the paper during printing. For example, I processed a public domain 1911 issue of Meekel's Weekly Stamp News as follows: a. Scan top half of a page in 300 dpi grayscale with de-screen turned on b. Scan bottom half of a page - same settings c. Rotate both scans using Gimp (GNU Imaging Program) d. Join images together by hand: d1. Open top half image d2. Double its height d3. Open bottom half image d4. Copy bottom half image into blank area below top half image d5. (troublesome) Join the two halfs by trimming off top part of bottom image. I tried a photo stitch program but it failed unless each image was exactly straight. e. Smooth background colors (grays) f. Adjust histogram g. Adjust curve h. Threshold at about 160 out of 255 to get most of the black colors i. Save as a monochrome bitmap file and a monochrome gif file Thanks |
Ads |
#2
|
|||
|
|||
Whew, some tough questions there
Just a few observations, upgrading your scanning equipment can be one way, if you have deep pockets. I have seen machines that will accommodate A3 and automatically turn the pages of books/magazines. For us mere mortals, then the standard scanner today can usually fit magazines OK, mine (Epson Perfection 1670) demands a minimum of 266 dpi for OCR, and successfully translates even the most toned newsprint with small type. For an A4 scan, I usually use 150dpi for full page, and save as an *.jpg which translates to about 200Kb per scan. I never use grayscale, just scan all my newspapers as "reflective" document type, and "photo" auto exposure type. Keep an eye out for new machine releases, Fujitsu has a new model called "scansnap" I believe. Not much help, just a few observations Good luck. | Does anyone have any experience and suggestions on how to better scan | in and process old stamp collecting newspapers for the web? | | Questions: | | a. What dpi resolution is needed for decent scans of 6 point text? (Old | newspapers use smaller type) | | b. What is the best way to process the grayscale image down from 256 | shades to monochrome? | | c. Is decent quality OCR possible on a gray background newsprint 6 | point text at anything below 300 dpi? | | d. What is the best way to scan in pages larger than your scanner | (e.g., 11x13 inches on an 8.5x11 inch flatbed scanner)? | | e. Is there a better way to get straight scans? I line up the edge of | the paper with the edge of the scanner glass and close the lid but keep | getting scans tilted. This is troublesome because old newspapers are | usually not printed exactly straight horizontally or vertically due to | bending of the paper during printing. | | For example, I processed a public domain 1911 issue of Meekel's Weekly | Stamp News as follows: | | a. Scan top half of a page in 300 dpi grayscale with de-screen turned | on | b. Scan bottom half of a page - same settings | c. Rotate both scans using Gimp (GNU Imaging Program) | d. Join images together by hand: | d1. Open top half image | d2. Double its height | d3. Open bottom half image | d4. Copy bottom half image into blank area below top half image | d5. (troublesome) Join the two halfs by trimming off top part of | bottom image. I tried a photo stitch program but it failed unless each | image was exactly straight. | e. Smooth background colors (grays) | f. Adjust histogram | g. Adjust curve | h. Threshold at about 160 out of 255 to get most of the black colors | i. Save as a monochrome bitmap file and a monochrome gif file | | Thanks | |
#3
|
|||
|
|||
a. What dpi resolution is needed for decent scans of 6 point text? (Old newspapers use smaller type) 150 will do, you might try 200 b. What is the best way to process the grayscale image down from 256 shades to monochrome? Don't. Leave it grayscale, or you loose too much info. Maybe you can size the files down by going for e.g. 64 grayscades, you can try. c. Is decent quality OCR possible on a gray background newsprint 6 point text at anything below 300 dpi? Again, try. Automatic contrast settings for most scanners do a decent job, but the question is too general for a specific answer. d. What is the best way to scan in pages larger than your scanner (e.g., 11x13 inches on an 8.5x11 inch flatbed scanner)? Dunno. e. Is there a better way to get straight scans? I line up the edge of the paper with the edge of the scanner glass and close the lid but keep getting scans tilted. This is troublesome because old newspapers are usually not printed exactly straight horizontally or vertically due to bending of the paper during printing. If you're willing to buy a sheet feeder and loosen the leaves of the magazines, you could try that. Otherwise it's a lot of manual labor. Good luck Jan |
#4
|
|||
|
|||
wrote in message oups.com... Does anyone have any experience and suggestions on how to better scan in and process old stamp collecting newspapers for the web? Questions: a. What dpi resolution is needed for decent scans of 6 point text? (Old newspapers use smaller type) b. What is the best way to process the grayscale image down from 256 shades to monochrome? c. Is decent quality OCR possible on a gray background newsprint 6 point text at anything below 300 dpi? d. What is the best way to scan in pages larger than your scanner (e.g., 11x13 inches on an 8.5x11 inch flatbed scanner)? e. Is there a better way to get straight scans? I line up the edge of the paper with the edge of the scanner glass and close the lid but keep getting scans tilted. This is troublesome because old newspapers are usually not printed exactly straight horizontally or vertically due to bending of the paper during printing. For example, I processed a public domain 1911 issue of Meekel's Weekly Stamp News as follows: You seem to be aware of the copyright issue, since publishers are likely to be on to you like the proverbial ton of bricks if you place anything still in copyright on the web without permission. Best of luck, Roger |
#5
|
|||
|
|||
|
#6
|
|||
|
|||
That's news! How do you do that?
| If you have Adode Acrobat or PaperPort, you can save them in PDF format. | You can extract the text from these as well. | Al |
#7
|
|||
|
|||
To extract text from an Adobe document, highlight the "select text"
tool, select the text you want, press CTRL+C, open the program where you want to save the text, press CTRL+V, and it will paste. Formats usually don't hold well, but it works. If it's a protected Acrobat page on display, you can use a screen image capture program like Kleptomania, draw a rectangle around the port you want that's on the screen, and copy it as an image to any image software. Resolution is limited to screen resolution, but it's good enough to OCR with if the text isn't too small. You can only capture what's visible with most of these, and may wind up pasting the parts together if the acrobat image is goes beyond the screen width or height. LN On Tue, 1 Mar 2005 14:16:49 +0800, "Rodney" wrote: That's news! How do you do that? | If you have Adode Acrobat or PaperPort, you can save them in PDF format. | You can extract the text from these as well. | Al |
#8
|
|||
|
|||
You purchase either software product and then perform the capture using
the scanner. Both programs create a PDF file. From this you can extract text. It does do an OCR on the file so it is not always 100% perfect to create the PDF. Acrobat allows you extract text from any non-protected PDF but there are shareware/freeware programs that can do it as well. Acrobat is pricey for any version. I am sure other PDF generation programs can do this as well. For example the Scott Catalogue in PDF format is protected so you cannot readily extract the text. The advantage of this capture is that it is a way to take your clippings or web pages and then save them electronically on your pc rather than yellowing (like clippings from newsprint) in a folder in your desk. The other advantage is you can search on text, etc. I plan to write an article on this in an upcoming issue of "The Compulatelist", the quarterly newsletter of PCSG (www.pcsg.org). Al Rodney wrote: That's news! How do you do that? | If you have Adode Acrobat or PaperPort, you can save them in PDF format. | You can extract the text from these as well. | Al |
#9
|
|||
|
|||
I'll retain your reponse.
I had a feeling the route was to OCR the capture, which to my mind is not really "extracting text" per se. I now understand some PDF files are not protected, but must be in the minority. I was reading a recent topical thread on Win98se.discuss NG and there were no solutions offered on extracting text (normally) from PDF from any of their vast readership. Thanks for the extended reply. | You purchase either software product and then perform the capture using | the scanner. Both programs create a PDF file. From this you can extract | text. It does do an OCR on the file so it is not always 100% perfect to | create the PDF. | | Acrobat allows you extract text from any non-protected PDF but there are | shareware/freeware programs that can do it as well. Acrobat is pricey | for any version. I am sure other PDF generation programs can do this as | well. | | For example the Scott Catalogue in PDF format is protected so you cannot | readily extract the text. | | The advantage of this capture is that it is a way to take your clippings | or web pages and then save them electronically on your pc rather than | yellowing (like clippings from newsprint) in a folder in your desk. The | other advantage is you can search on text, etc. | | I plan to write an article on this in an upcoming issue of "The | Compulatelist", the quarterly newsletter of PCSG (www.pcsg.org). | | Al | | Rodney wrote: | That's news! How do you do that? | | | | If you have Adode Acrobat or PaperPort, you can save them in PDF format. | | You can extract the text from these as well. | | Al | | | |
#10
|
|||
|
|||
These are the specifications of the Fujitsu "scan snap" if you have approx $800 burning a hole in your pocket. Description The ScanSnap fi-5110EOX is a true one-touch solution which enables you to scan directly to PDF, email or file with the touch of a button. This is a 50-page Automatic Document Feeder (ADF) with fast monochrome and color scan rates of up to 15 pages per minute (ppm) / 30 images per minute (IPM) and 600dpi optical resolution. It features automatic color, page size and length detection. ScanSnap automatically straightens and aligns text and images into their correct orientation and also automatically detects the following paper sizes: A4, B5, A5, B6, A6, Business Card, Legal and Letter. Since ScanSnap automatically recognizes and eliminates blank pages, each scanning job progresses smoothly, even when scanning combinations of one-sided and two-sided documents. ScanSnap automatically separates color documents from black and white ones and saves this information in highly compressed files, thus saving storage space. Key Features Type Path-Through Scanner Interface USB 2.0 Optical Resolution 1200 dpi Max. Resolution (Hardware) 1200 x 1200 dpi Max. Resolution (Interpolated) 600 x 600 dpi Platform PC Technical Features Form Factor Desktop Scan Element Type CCD Input Type Color Special Features OCR Capability Automatic Document Feeder Capacity 50 Pages Media Supported Media Type Business Cards, Plain Paper Media Load Type Automatic Document Feeder Max. Supported Media Size Legal (216 x 356 mm) System Requirements Operating System Microsoft Windows 2000, Microsoft Windows 98, Microsoft Windows 98 SE, Microsoft Windows XP Home Edition, Microsoft Windows XP Professional Dimensions Width 11.2 in. Depth 5.7 in. Height 5.9 in. Weight 5.9 lb. Warranty Warranty 1 Year Miscellaneous Included Accessories Automatic Document Feeder, USB Cable,Adobe Acrobat,ScanSnap Scanning Software,PFU Business CardMinder,ScanSnap Specific drivers | Does anyone have any experience and suggestions on how to better scan | in and process old stamp collecting newspapers for the web? |
|
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Non-Sports Cards to Trade, Sell or Buy | Susan O'Fearna | Cards:- non-sport | 0 | October 30th 04 05:40 AM |
New Finland Stamp Issue | Stamp Master Album | US Stamps | 0 | May 29th 04 11:38 AM |
Poggiali World Champion 250cc Stamp Pane | Stamp Master Album | US Stamps | 0 | April 24th 04 11:42 AM |
FS: Non-Sports PROMO Cards/Sets/Sheets 1994 Part 2 | J.R. Sinclair | Cards:- non-sport | 0 | March 22nd 04 06:02 AM |
[Fwd: FA Stampoffers] | Doug Buss | Marketplace | 0 | October 11th 03 02:24 AM |