Most elegant way to parse this kind of page?

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
Because this website has this in the DOM:

href="main.php?g2_itemId=3809"><IMG alt

I happily parse

main.php?g2_itemId=3809

with

(?<=href\=\").*(?=\"\>\<IMG alt)

which leads me to:

http://mywebsite.com/page/{-RegExp.RegExp-|-{-FieldData.FieldData-|-scrape run-|-Dom text-}-|-(?<=href\=\").*(?=\"\>\<IMG alt)-|-1;10-}

this way i have all the links. Only I don't of course.
http://mywebsite.com/page/ only appears with the first match. The other matches are left on their own, as main.php?g2_itemId=xxxx

I'd rather a more elegant solution than the only one i can think of right now, which is:

http://mywebsite.com/page/{-RegExp.RegExp-|-{-FieldData.FieldData-|-scrape run-|-Dom text-}-|-(?<=href\=\").*(?=\"\>\<IMG alt)-|-1-}
http://mywebsite.com/page/{-RegExp.RegExp-|-{-FieldData.FieldData-|-scrape run-|-Dom text-}-|-(?<=href\=\").*(?=\"\>\<IMG alt)-|-2-}
http://mywebsite.com/page/{-RegExp.RegExp-|-{-FieldData.FieldData-|-scrape run-|-Dom text-}-|-(?<=href\=\").*(?=\"\>\<IMG alt)-|-3-}

adding all that into the macro. What if I have a hundred links. This is clearly not practical. What would you do? thanks
 

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
681
Баллы
113

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
http://mywebsite.com/page/{-RegExp.RegExp-|-{-FieldData.FieldData-|-scrape run-|-Dom text-}-|-(?<=href\=\").*(?=\"\>\<IMG alt)-|-all-}...doesn't this work.

Here's an example with Google Image search

Посмотреть вложение 390
unfortunately this does not cover my aim because stuff like http://organic-food-blog.com/wp-content/uploads/organic-birds.jpg can be found completely in one take inside your DOM text. However when scraping 10 or 100 links in a forum or a blog many times, if not all as far as my experience goes, you will find that the DOM only provides the final part of the link. In forums, for example, this will be something like showthread?=234, while the beginning of the the link remains in some kind of basic underpinning all the showthreads follow from, despite not showing fully in the DOM. The problem with this is as I said above, I'd have to write long ass macros to cover each link, and I'm sure there's gotta be a shorter and more elegant way of doing this.
 

lydra

Client
Регистрация
16.05.2011
Сообщения
17
Благодарностей
1
Баллы
0
In PHP, I would get the base address on a variable ("http://mywebsite.com/page/"). Then I would save all relative links from the dom (those without http://). Then I would concatenate the two to create absolute urls. Being a 1 day long user of Zennoposter, I can't actually help you with implementing that.
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
In PHP, I would get the base address on a variable ("http://mywebsite.com/page/"). Then I would save all relative links from the dom (those without http://). Then I would concatenate the two to create absolute urls. Being a 1 day long user of Zennoposter, I can't actually help you with implementing that.
Well i've been fumbling around with the inbuilt js and regex in the macro builder and nowhere can i find a concatenate function. there is an option to add one's own code, so while i fumble around and actually find out how to add it, could you be so kind as to tell me how concatenate works in php? perhaps i can make a field value out of it and ram it into the results i've already got.

they say PHP is best for scraping, that ZP's only strength might be the proxies, but i'm no expert.
 

gemini

Client
Регистрация
10.03.2011
Сообщения
160
Благодарностей
31
Баллы
28
concat function is about joining strings. Joining function isn't needed in zenno, you might freely join misc strings.
you would rather need a url normalization function which will take base url, and relative url, merging it correctly to avoid crap when you have e.g.
http://yahoo.com/sub/ and ../sub2/
simple concat will make you http://yahoo.com/sub/../sub2/ while you need: http://yahoo.com/sub2/ and so on.
There is no such function in zenno of course. perhaps it's possible to create JS, but far simpler is to use external php script.
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
hi there, once again i'm pretty much new at this, a bit like an annoying mosquito, but hell here goes

are we talking about the php concatenation operator in w3schools.org?

<?php
$txt1="http://myforum.com/";
$txt2="showthread.php?=4235";
echo $txt1 . "" . $txt2;
?>

If so how does this differ from the problem i'm having at zp. i could use field data:

$txt1="http://myforum.com/";
$txt2="{field data of append file string with all my showthreads.phpand magical numbers}";
echo $txt1 . "" . $txt2;

but even if i selected append all field results strings, i'd just have

hxxp://myforum.com/showthread.php?=4235showthread.php?=4236showthread.php?=4237

the difference with zp's macro would be they'd all be in the same line, whereas in my example in zp i was getting them in different strings, with myforum.com only affecting the first string.
 

gemini

Client
Регистрация
10.03.2011
Сообщения
160
Благодарностей
31
Баллы
28
you misuderstood me.
You can join strings with zenno - no problem.
With php you can make advanced url joining which would check for different types of relative urls.
As for relative urls there are 2 types:
showtread.php?sth, or ../../showtread.php?sth - relative to current directory
/path/showtread.php - relative to root of the domain.

However if you are making template for specific site, you might not need that - since you adjust your macro accordingly for specific type of url.
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
Sorry, I meant that from what i know in php all php allows is a simple concatenation function as I described above. How to make it more complex?

Is it as simple as setting this up in localdomain:

<?php
$txt1="get field data of the domain name";
$txt2="get field data of all the showthreads which I have just found in a previous step";
echo $txt1 . "" . $txt2;
?>

But how do I in PHP get all the results, all the showthreads?

I want

hxxp://myforum.com/showthread.php?=4235
hxxp://myforum.com/showthread.php?=4236
hxxp://myforum.com/showthread .php?=4237

but this php code will give me:

hxxp://myforum.com/showthread.php?=4235showthread.php?=4236showthread .php?=4237
 

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
681
Баллы
113
Here's another try at it JP1. This time I got it to join the strings and save it to a text file. I'm working on trying to find the regex to remove the duplicate urls from the file but not having much luck. Even with Regex Buddy, when you put the regular expression that it gives you into ZP, the output is not as predicted. I'll let you know if I can find a workaround or maybe someone here has experience with it.

Посмотреть вложение forumimg.xml
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
thanks. nice, i didnt think about counters!

it might do for now, but it still is a very big work around, might even be time intensive if there are hundreds of urls, so hundreds of counts.

deduping in zp is better done, or more easily than with regex, done at least with a special macros they have for it:

{-String.RemoveDuplicates-|-{-FieldData.FieldData-|-dedupe-|-readfile-}-|-{-String.Enter-}-}

i am still currently looking for a solution in php, which i'm reading up a manual because its about i did so anyways. in php the concatenation operator is the simplest thing in the world

print (“hello”.” world”)

gives

hello world

but evidently right now it would seem a php script (or a program) would only be something that can be input into zp, as php is unlikely to understand field data, so i'm looking for a workaround for that. perhaps learning how to output into a file, and getting php to read that file, i'll see if that's possible.
 

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
681
Баллы
113
The way that I wrote it it doesn't matter how many matches there are. When it outputs an empty string, the program stops.
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
The way that I wrote it it doesn't matter how many matches there are. When it outputs an empty string, the program stops.
yeah thanks, it works fine.

i dont really think extra steps make for a more resource intensive template, thats just silly. At least anymore than calling in a php on my wampserver.

after reading up on PHP, indeed it is very easy to transfer the results into a variable for PHP. Just not directly though as field data, but rather indirectly probably via a file. I'm currently figuring this out because i have to learn PHP anyways.

did the remove string duplicates work good for you?
 

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
681
Баллы
113
Nah, didn't have much luck with it. That worked good for lines that were one right after the other, but didn't match the lines that were further down. I did get one to work. I had to set up a loop to see if there was any matches. If there were the line I was trying to match got deleted and the next line was parsed. Eventually there would be no matches so that line was written to a file. I've got it if you want it.
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
Nah, didn't have much luck with it. That worked good for lines that were one right after the other, but didn't match the lines that were further down. I did get one to work. I had to set up a loop to see if there was any matches. If there were the line I was trying to match got deleted and the next line was parsed. Eventually there would be no matches so that line was written to a file. I've got it if you want it.

Yes please, its always good to expand one's regex library :cool:


However, the macro I posted seems to work for me. You have to do a file get block before, like this:Посмотреть вложение duplicates.xml

But if it still doesnt work, gimme a shout. Im curious as to in which situation it might not be working.
 

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
681
Баллы
113
I like the one you did. Couldn't figure that one out so thanks. I didn't think of putting the String.Enter inside.
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
I like the one you did. Couldn't figure that one out so thanks. I didn't think of putting the String.Enter inside.
ok good it works :-)

could u plz give me the regex for it, i might be needing it for later on with other languages

i believe u said u had the regex for removing duplicates?

thanks :-)
 

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
681
Баллы
113

Кто просматривает тему: (Всего: 1, Пользователи: 0, Гости: 1)