Announcement

Collapse
No announcement yet.

Foreign language OCR?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Foreign language OCR?

    I'm good at starting projects but never really completing them. So I think I'll start another one. Back in April 2024, a book was published, for the first time, in English ("Passcode to the Third Floor Secretariat"). So I don't need to use OCR and online translation portals for that one. But another book, "Fully Controlled Areas", ISBN 978-89-90959-28-7, has not been translated to English and probably never will be. (And I have got the only copy I will ever have, trust me). Lots of crude, hand-drawn illustrations of people starving and getting executed in North Korea. The text should be quite revealing. I neither read, speak nor understand Korean (or I would not be making this post).

    If I get a flatbed scanner, and hold the book down on it, should I use "Tesseract"? If the latest version, version 5, needs the latest libraries (i.e. won't work with Win7), I can always pull out an unused 500 GB SSD and install the latest Ubuntu. (AMD 8150, 64 GIGS of RAM, Radeon 6770 completely passive-cooled video, that should be enough.)

    Using the command line for each page does not bother me. It will be just like using the command line to convert "Corel PhotoCD" files to a JPG. After I get the text files (Korean characters can be viewed in Notepad for Win7, not sure about Office 97/03), then I can drop the text into an online translation web portal.

    Does anybody have any experience in this area? The documentation seems to suggest this will recognize text well. Does anyone have experience in this area?
    Last edited by Hondaman; 07-17-2024, 05:15 AM.

    #2
    i would use Tesseract and if that wont work try Cuniform

    Comment


      #3
      i have used tesseract on ubuntu before from the command line as well in some scanner software (there are undocumented known bugs i need to fix)
      eg: somehow multi page scans broke for my mom, there was another but rather than mention it i went and fixed it cause it seemed like less work

      sometimes i have to crop a image or upscale the image to get good output sometimes just a basic upscaling in gimp is enough though
      * https://github.com/nagadomi/waifu2x
      Last edited by evilkitty; 07-17-2024, 11:57 AM.

      Comment


        #4
        Here's an update to this project. I got a brand new CanoScan 8400 for just under $51 from eBay. It seems to work under Win7 only when you turn the scanner on AFTER Win7 boots up. And if Win7 reports problems with the TWAIN driver, this problem may survive rebooting. Also note that Windows <CTRL> <C>, <CTRL> <V>, etc might not work properly as long as the Korean keyboard is installed, even if I am typing in English. So use the mouse for Windows cut and paste.

        My workflow:

        The Canon software scans the file as a .BMP, then I use PS Elements 5 to cut out the pencil drawings of people getting executed, and the footers, and save a .JPG of the cropped text. Then I upload it to online dot easyscreenocr dot com. Sadly, this returns one big long string of Hangul characters. You still need the physical book to insert carriage returns at the end of paragraphs.

        It is okay to move a line or two of Korean characters from one page to the next, so I can drop each paragraph into Google Translate. Google Translate seems to do a better job than it did in November 2022. (Back then I shot pictures with the Galaxy S10+ instead of using the flatbed scanner). Today, Google might even handle Chinese / Japanese characters that sometimes come up in Korean text.

        I am attaching a copy of what I have so far -- the preface / foreword and some of the "juicy stuff" (Microsoft Word 97). I don't think copyright laws apply here, as the book is VERY out of print, and will probably never be in my language, and I'm doing this for scholarly or research purposes. Here is the cover:

        Click image for larger version  Name:	fully-controlled-areas-book.jpg Views:	2 Size:	43.6 KB ID:	3317194

        I am also including the output that TESSERACT gave me for page 10 (Linux Mint 18.3 I think). I'm sorry it did so badly, but this is a difficult task and I'm sure the software authors worked very hard on it.

        To quote stj from thread #63446, post number 171, on this forum:

        "Your knowledge of Korean history is terrible"
        I simply MUST disagree. I collect and enjoy reading books by NK defectors. Riots at prison camps? 5,000 people killed in one incident in the early 1970s? I'll bet you never knew that! (Read the attached files.)


        Click image for larger version  Name:	page-121-massacre.jpg Views:	0 Size:	1.73 MB ID:	3317204


        (I think Shin Dong-Hyuk only mentioned 1,500 labor camp prisoners herded into a coal mine and then the opening was closed by blowing it up with dynamite.)

        I also enjoy books from innocent exonerated prisoners (USA/UK/ROK, some are even AUTOGRAPHED). And books written by Scientology cult defectors. And if you get the chance, you should see "1987: When the Day Comes". Great movie (DVD Region 3 or Blu-Ray Region A). The deaths of Park Jong-chul (waterboarding) and Lee Han-yeol (hit by tear gas canister) no doubt helped the democratization process. If South Korea had not become a democracy, I would probably not have a Samsung cell phone or multiple, spare Samsung 850/860/870 EVO SSDs lying around the house. And for an inside explanation of current events, please check out "Passcode to the Third Floor" for the "straight story" from the man who ran the NK Embassy in London, Copenhagen and possibly other cities.

        I would offer to e-mail the translation to anyone who wants it, but I will probably NOT be making consistent, steady progress on this project.
        Last edited by Hondaman; 07-27-2024, 02:05 AM.

        Comment


          #5
          your idea of "history" is different from mine,
          i meant not just before the u.s. split the country but hundreds of years back.

          Comment


            #6
            Samsung and Hyundai have been raised by president Park Chung-Hee,and Now leftist party ,which have been based on student political movements like deaths of Park Jong-chul-Lee Han-yeol, start to ruin heritage of economic prosperity from Park Chung-Hee.They fanaticize depriving the upper class of their income in the name of taxes .Now many ceo in south korea break into exodus toward U.S.A and Australian or singapore where is in favour of company operation.
            Last edited by chth96; 07-27-2024, 09:21 PM.

            Comment

            Working...
            X