Scraping inurls and / in google

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
Hi :-)

I have this scraper for google, and it works fine with normal problems. Thing is I still haven't mastered regular expression and would greatly appreciate what value you'd insert when scraping things like:

inurl:domain.com/example

Посмотреть вложение google.xml

Thing is I substitute 'zennoposter' in for anyting with things like '':'' or ''/'' and it fails.

It's my belief these signs break the value and the scraping gets confused due to its regular expression.

Greatly appreciate any help, thanks.
 

Stereomike

Client
Регистрация
29.03.2011
Сообщения
221
Благодарностей
30
Баллы
0
I didn't have a look at your xml, but some tips for your regex: If your search term changes, let these formfields empty and define the 'comes before' and 'comes after' fields. I don't put anything by hand into the regex field, I just take care of testing the search parameters thouroughly.
 

ziavra

Client
Регистрация
26.06.2009
Сообщения
116
Благодарностей
4
Баллы
0
You can use {-RegExp.Escape-|-expression-}
e.g.
Код:
{-RegExp.Escape-|-inurl:domain.com/example-}
or whatever you need to parse later

i could not find macros section in the english version of help so you can try to read russian version with a help of google translator
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
I didn't have a look at your xml, but some tips for your regex: If your search term changes, let these formfields empty and define the 'comes before' and 'comes after' fields. I don't put anything by hand into the regex field, I just take care of testing the search parameters thouroughly.
Funnily enough this kinda method has given me disastrous results. I never manage to get lookaheads and lookbehinds working yet in regex, though so I still gotta refine my regex, but whenever I did that because it would seem to be easy, it was anything but. And all I saw was a massive number of escapes appear for little to no reason, confusing me and taking me to Expresso, where I learned a bit more about regexp anyways. What i mean to say is the classical match bit by bit seems to work as usual, but the lookaheads and lookbehinds dont for some reason. Not putting anything gives no match in a case I'm just about to post here, and putting .?* or W\w have usually given me no results to.
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
You can use {-RegExp.Escape-|-expression-}
e.g.
Код:
{-RegExp.Escape-|-inurl:domain.com/example-}
or whatever you need to parse later
does this turn the search criteria into the regular expression? I'm not sure I understand.

Thanks to the both of you for your responses.
 

ziavra

Client
Регистрация
26.06.2009
Сообщения
116
Благодарностей
4
Баллы
0
If you want to generate regular expression "on-the-fly" you have to prepare it with a help of the special macro.
1st step generate regexp
2nd step {-RegExp.Escape-|-regexp-}
3rd step {-RegExp.RegExp-|-string-|-{-RegExp.Escape-|-regexp-}-|-all-}
 

Кто просматривает тему: (Всего: 1, Пользователи: 0, Гости: 1)