Showing posts with label image. Show all posts
Showing posts with label image. Show all posts

9.25.2011

Extract text from pdfs and images with GIMAGEREADER


gImageReader

gImageReader is a graphical GTK frontend to tesseract-ocr, a free software optical character recognition (OCR) engine.

Tesseract is a raw OCR engine, with no document layout analysis, no output formatting and no graphical user interface (GUI).

gImageReader processes an image or PDF file from which it creates text. It supports selecting columns and parts of the document, it can open multipage PDF files or images, supports all formats, can transmit a selected area to Tesseract for recognition and spell check the output.


Optional: Install Tesseract OCR 3.0 SVN in Ubuntu Lucid and MAverick

Tesseract OCR 3.0 is still in development but in my tests it worked much better then the current stable version. Further more, the PPA below comes with a lot of extra Tessaract language files so I suggest installing the latest Tesseract OCR 3.0 SVN. This is however is optional!

Warning: you must add the PPA, install the latest Tesseract and then disable the PPA as it contains a lot of bleeding edge packages!

Add the PPA and install Tesseract OCR 3.0 SVN:
sudo add-apt-repository ppa:alex-p/notesalexp
sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng

Tesseract OCR romanian

You can install some extra languages from this PPA, such as Bulgarian, Catalan, Czech, Danish, German, Greek, Finnish, Indonesian, Hungarian, Italian, Dutch, Polish, Romanian, Spanish and so on. Simply search for "tesseract-ocr" in Synaptic and you should easily find all these packages - install the ones you'll need later on.


Now you must disable the PPA: press ALT + F2 and enter:
gksu software-properties-gtk

Then, on the "Other Software" tab look for the line(s) that says "http://ppa.launchpad.net/alex-p/notesalexp" and either disable it or delete it.

gImageReader

gImageReader is available for Linux and Windows and can be downloaded from HERE(.deb, .rpm and .exe files are available).

To use gImageReader, select the PDF or image you want to extract the text from and click "Recognize all" for the whole page or use your mouse to draw a selection and then click "Recognize selection" to extract only a part of the document.

If you've installed the Tesseract Ocr language for the PDF or image you're trying to open, gImageReader will automatically detect the language.

Thanks to LFFL for the gImageReader tip!


8.23.2011

Twitter Enhances User Profiles With Image Galleries



Twitter will start rolling out its 'user galleries' from Monday. Galleries will display the 100 most-recent images the user has tweeted — dating back to January 1, 2010 — from supported photo-sharing services.

Photos can be posted to Twitter via its new photo-uploading tool or through a third-party photo-sharing service such as yFrog, TwitPic or Instagram.

Galleries will live on a user’s profile and highlight a few recent images. A visitor can click the “view all” button to see even more images in either a grid view showing image thumbnails or a detail view highlighting the most-recent image and the text of the tweet that was shared along with it.

The update ties into Twitter’s photo-sharing push and will dramatically change the appearance of Twitter profiles. Galleries will provide equal billing to images shared via third-party app makers but also serve to remind users that Twitter is no longer a place just for 140 characters — it’s for photos, too. The update is likely designed to entice Twitter users to add more photos to their tweets.

Galleries, at launch, will be image-only. Twitter Communications Manager Carolyn Penner said in a tweet that users can expect to see the update Monday. “We’re rolling out one of my fave features today: user galleries! View photos an account has shared on Twitter. Sit tight – it’s coming soon,” she tweeted.

Related Posts Plugin for WordPress, Blogger...