How to scrape images?

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
Does anyone have an example template of how to do this?

The problem with things like facebook profiles or google images is they seem to be hidden in something known as body onloads and thats javascript I'm not familiar with. The alternative for a retard like me would be to click on all the links to get to the jpg, but I really would love to learn how to go the short way and scrape directly off the facebook or google page what I see is what I get style.

Helping hand, anyone? thanks :-)
 

darkdiver

Administrator
Команда форума
Регистрация
13.01.2009
Сообщения
2 284
Благодарностей
2 728
Баллы
113
If you do not know the image path and you can't download it try this way
Click on the image with the right button.
Select this is a captcha.
Select Recognition module CaptchaSaver.dll
Set the name of the picture as the parameter for the recognition module.
 

Stereomike

Client
Регистрация
29.03.2011
Сообщения
221
Благодарностей
30
Баллы
0
It's all in the search result of e.g. a google image search. Send the DOM text to the regex builder and watch out for the image-urls, I don't remember what settings I used, but getting the images works reliably. I let Zenno write all the matches of the regex into a file (list of image urls) and sent it to wget (free external commandline tool that can download images by using a url-list that you provide); you fire off wget with 'own script' branch.
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
It's all in the search result of e.g. a google image search. Send the DOM text to the regex builder and watch out for the image-urls, I don't remember what settings I used, but getting the images works reliably. I let Zenno write all the matches of the regex into a file (list of image urls) and sent it to wget (free external commandline tool that can download images by using a url-list that you provide); you fire off wget with 'own script' branch.

I'm not sure I follow, sorry. When typing 'stuff' into google images, i try with the first image. I get
if i left click on the image on the page. But when I look for that in the DOM or the Source HTML it's nowhere to be found.

Even if I reduce it to

I still can't find it. Clicking on all the images to get them in tab would be very messy for a template and this wouldnt work in facebook where the profile images lead to a profile and not to the picture. So I'm afraid I'm missing something here.
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
If you do not know the image path and you can't download it try this way
Click on the image with the right button.
Select this is a captcha.
Select Recognition module CaptchaSaver.dll
Set the name of the picture as the parameter for the recognition module.
That would be a good solution for me if it were one image or a few, but I'm afraid there wouldnt be any regexp to parse a number of images on varying pages.
 

Stereomike

Client
Регистрация
29.03.2011
Сообщения
221
Благодарностей
30
Баллы
0
To get the images, read source text (sorry, weren't DOM), then parse it with the following macro:

In the macro builder go to: Regular Expression ->Parse with regular expression.
In the 'input string' field you put the id of the step that has read the page source.
In the 'regular expression' field you put (?<=imgurl\=).*?(?=&amp).
In the '# of match' field you put 0;end (that tells the macro to fetch every match it finds, from start (0) to end).

About the expression:
it looks for anything, that has 'imgurl=' before it and '&amp' after it.

<A href="/imgres?imgurl=http:/ landscaping.savvy-cafe.com/wp-content/uploads/2007/03/irish-landscape.jpg&amp;imgrefurl=http://landscaping.savvy-cafe.com/category/landscaping-photos/&amp;usg=__AcFkrCS4tc73b1bLe0rshqUzpRI=&amp;h=375&amp;w=500&amp;sz=114&amp;hl=en&amp;start=7&amp;zoom=1&amp;tbnid=0g_5fmzm73ep4M:&amp;tbnh=98&amp;tbnw=130&amp;ei=QAW9TaKXGJLG8QPt9NTABg&amp;prev=/search%3Fq%3Dlandscape%26hl%3Den%26gbv%3D1%26tbm%3Disch&amp;itbs=1">

(had to delete a '/' in that code above, otherwise it would get autoformatted as link)

Here's the finished expression btw:
{-RegExp.RegExp-|-YOUR SOURCE HERE-|-(?<=imgurl\=).*?(?=&amp)-|-0;end-}

Afterwards write everrything to a file.
Send this file to wget with this command:
d:\yourApplicationPath\Wget\bin\wget.exe -i "d:\yourInputFilePath\googleImg.txt"

I had to write that to a .bat file, cause the 'binary path' field of the 'own program' object does get confused about using " in it. So you write that above line to a file named fetchimages.bat and start this file with the 'own program' object.

(You have to install wget btw)

The images will download into your zenno folder.

Hope it gets you off the ground.
Once you understand how the expression builder and the logic branch works, it all makes sense and nothing seems impossible :-)
 
  • Спасибо
Реакции: schooly

Hungry Bulldozer

Moderator
Регистрация
12.01.2011
Сообщения
3 441
Благодарностей
831
Баллы
113
  • Спасибо
Реакции: pink

pink

Client
Регистрация
21.04.2011
Сообщения
54
Благодарностей
3
Баллы
8

Hungry Bulldozer

Moderator
Регистрация
12.01.2011
Сообщения
3 441
Благодарностей
831
Баллы
113

Stereomike

Client
Регистрация
29.03.2011
Сообщения
221
Благодарностей
30
Баллы
0
works here too. Nice site, bookmarked it :-)
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
thanks guys that helped very well indeed.

i think i'll go with the captchasaver example right now because i need the images small anyways and its simpler :-)
 

SeRf*X

Client
Регистрация
02.04.2011
Сообщения
35
Благодарностей
4
Баллы
8
Have one problem here, i was trying to download photos using captchasaver.dll method but on right clicking it it shows no presence of a captcha so i can't select it.
Only when i click through it and it was enlarge to one single image, here when right click it could see "This is a captcha" but upon running debug it's save into .jpeg but only i can see is a small black square image.

Tried also the wget but in facebook using DOM source the photos url not seen using regex, only those smaller images 150X150 url on the sidebar can be found whereas those Profile Pictures can't be seen.

Anyone tried this in facebook yet?
 

username

Новичок
Регистрация
24.10.2011
Сообщения
13
Благодарностей
1
Баллы
3
I am having hard times with another image based solution. I need to grab data displayed in flash from ocr result. How can I send notification with some sort of unique code that I need a verification code from an specific account? I need all threads have right result, in case I buy zenno one day. I can not do ocr in zennoposter, but there are loads of solutions to do ocr, also free and online. I just cant figure out how to trigger another party to do that and how to get the result back in zenno. It's a brilliant software, but can cause headaches :-).

BR,

Elsa
 

Кто просматривает тему: (Всего: 1, Пользователи: 0, Гости: 1)