7/25/2019 Python Quebrar Captch Python Ocr
http://slidepdf.com/reader/full/python-quebrar-captch-python-ocr 1/4
Python OCR… or how to break CAPTCHAs
http://blog.c22.cc/2010/10/12/python-ocr-or-how-to-break-captchas/
After my little stint writing the scr.im PoC script, a few people on Twitter reminded me of a
blog post that Andreas Riancho from Bonsai-sec wrote back in Febrary. Andreas !the creator
of the excellent W3AF tool" wrote a short Python script to take a CAPTC#A image and perform
an $CR on it. As a geek, this pi%ed my interest, bt the one problem & had with it was that
the script relied on the pytesser Python library, which is 'indows only(
There were a few isses with that.
). &t*s 'indows only and & prefer to a+oid 'indows nless there*s no other choice
. The proect only e+er reached +ersion ..)
/. The proect has been abandoned since 0ay 1
2o, not wanting to gi+e p on something that looked fn, and also sefl, & started a search
for an alternati+e. & %ickly fond that the pytesser Python library is a wrapper arond
the tesseract-ocr proect, and that there had been some work on another Python library
called Python-Tesseract that looks like it does the ob !and isn’t platform dependent ".
After installing tesseract-ocr !apt-get install tesseract-ocr on Backtrack " & downloaded the
Python-tesseract 3les and modi3ed the script from Andreas Riancho a little !the actual
changes to make things work are minimal". & also changed a few things to get the script to
reasonably accrately decode scr.im captcha images.
#!/usr/bin/python
# [PoC] tesseract OCR script - tuned for scr.im captcha
#
# Chris John Riley
# blo.c.cc
# contact ["] c [$O] cc
# %/%&/&%&
# 'ersion( %.&
#
7/25/2019 Python Quebrar Captch Python Ocr
http://slidepdf.com/reader/full/python-quebrar-captch-python-ocr 2/4
# Chanelo
# &.%) *nitial +ersion ta,en from "ndreas Rianchos
# eample script 0bonsai-sec.com1
# %.&) "ltered to use Python-tesseract2 tuned imae
# manipulation for scr.im specific captchas
#
from P*3 import *mae
im 4 *mae.open0captcha.5p1 # 6our imae here!
im 4 im.con+ert07R89"71
pidata 4 im.load01
# :a,e the letters bolder for easier reconition
for y in rane0im.si;e[%]1(
for in rane0im.si;e[&]1(
if pidata[2 y][&] < =&(
pidata[2 y] 4 0&2 &2 &2 >>1
for y in rane0im.si;e[%]1(
for in rane0im.si;e[&]1(
if pidata[2 y][%] < %?@(
pidata[2 y] 4 0&2 &2 &2 >>1
7/25/2019 Python Quebrar Captch Python Ocr
http://slidepdf.com/reader/full/python-quebrar-captch-python-ocr 3/4
for y in rane0im.si;e[%]1(
for in rane0im.si;e[&]1(
if pidata[2 y][] ) &(
pidata[2 y] 4 0>>2 >>2 >>2 >>1
im.sa+e07input-blac,.if72 78*A71
# :a,e the imae bier 0needed for OCR1
imBori 4 *mae.open0input-blac,.if1
bi 4 imBori.resi;e00%&&&2 >&&12 *mae.D"RDE1
et 4 7.tif7
bi.sa+e07input-D"RDE7 F et1
# Perform OCR usin tesseract-ocr library
from tesseract import imaeBtoBstrin
imae 4 *mae.open0input-D"RDE.tif1
print imaeBtoBstrin0imae1
A maority of this code is preparation, the actal $CR ob is performed in the 3nal lines sing
the image4to4string call. 2imple isn*t it(
The abo+e script is tned to the scr.im captcha image. As can be seen by the below
e5amples6
7/25/2019 Python Quebrar Captch Python Ocr
http://slidepdf.com/reader/full/python-quebrar-captch-python-ocr 4/4
As yo can see, after rnning it throgh some 3lters !thanks Andreas", the CAPTC#A becomes
a lot clearer, and signi3cantly easier to $CR. 7+en in this case howe+er, tesseract-ocrsometimes retrns the +ale as '8B#P instead of '89#P. 2till, that*s an easy mistake to
make: and &*m sre with more tweaking, the preparation cold be perfected(
2o, ne5t time somebody says ;we implemented a CAPTCA to pre!ent scripted attacks;, yo
can take it with a pinch of salt(
Links6
• <PoC= scr.im.tesseract.py script >? here
• Breaking 'eak CAPTC#A in 8 @ines of Code >? bonsai-sec.com
• Pytesser >? here
• Tesseract-$CR >? here
• Python-Tesseract >? here