Software for caching web content and therefore reducing traffic costs

morpheus93

Client
Регистрация
25.01.2012
Сообщения
1 039
Благодарностей
237
Баллы
63
I use luminati proxies and they charge about $15/GB. So if I create e.g. 1k gmx accounts it would cost me about $45 if I load all the elements on their sign up page and cache nothing (not to mention the shitty ads and "news" on their main site if you want to make it more legit). Already tried to disable all the images and unnecessary scripts on the site but traffic per account remains still at about 1MB.

I think if I can cache all the elements that remain the same on every sign up (but avoid the fingerprinting-/tracking ones that need to be loaded again every time) that I can reduce the cost to 25%. But I don't know how to correctly cache all the needed things and reload the other ones in ZP...

Maybe there is an software out there that works like AdGuard (an intercepting "proxy" server that filters out advertisement from your surfing traffic). Working as an indermediate that caches most of the elements on specified websites and gives the opportunity to (re-)load the needed things, so saving most of the normally needed bandwith. Read something about Squid, but seems too complicated to configure, so an easy to set up solution would be preferred.

Thank you guys for your appreciated help!

77219
 
  • Спасибо
Реакции: Aronax

Aronax

Client
Регистрация
29.01.2015
Сообщения
201
Благодарностей
59
Баллы
28
I don't have the perfect solution for what you need but maybe these will help you:

1. Stop using the browser and use only http requests and emulate only those that you need (excluding the adverts, images, css etc). This will drastically reduce the bandwidth consumption, increase the speed of your bots, allow you to run the template on 100x more simultaneous threads on the same hardware config. However, on sites that have serious anti-spam guard, you will also need to emulate the tracking/ fingerprinting requests and this could be a real pain because at some point you might need to reverse engineer some complex/ obfuscated javascripts.

2. This is more like a hybrid solution: you continue to use the browser but you check what http requests are not necessary (with fiddler or in the traffic tab of PM). You block the domains in zenno (use content policy block) from which un-necessary things are being downloaded (example: the domain from which the ads are serverd; the domain from which the images are served, which is usually a cloud storage etc). You can also block the domains/ URLs from which different javascripts are being sent to your website (this is useful when you don't want your browser to execute different javascripts and save processing resources; however, not executing the javascripts related to tracking/ fingerprinting will most certainly raise a red flag). After you finish blocking all the domains that are not necessary in your template, check how your task on that specific website works. Some websites might throw capchas or block your activity if too many things are blocked.

3. If possible, load that particular page with a profile on your home IP (or with a proxy that does not have bandwidth limitations), save the profile (make sure cache and cookies are saved, too) and then use that profile with the proxy that has bandwidth limitations. This doesn't really work with websites that have powerful tracking methods. A related method would be to "copy" the page you're loading frequesntly and re-create it on a different domain, making sure all identical elements are in your new page. Load that page on your domain before going to the target domain.


I would also like to know if there is an "out of the box" solution that does caching because none of the a/m ideas are perfect
 
Последнее редактирование:
  • Спасибо
Реакции: ElonMusk и morpheus93

morpheus93

Client
Регистрация
25.01.2012
Сообщения
1 039
Благодарностей
237
Баллы
63
I don't have the perfect solution for what you need but maybe these will help you:

1. Stop using the browser and use only http requests and emulate only those that you need (excluding the adverts, images, css etc). This will drastically reduce the bandwidth consumption, increase the speed of your bots, allow you to run the template on 100x more simultaneous threads on the same hardware config. However, on sites that have serious anti-spam guard, you will also need to emulate the tracking/ fingerprinting requests and this could be a real pain because at some point you might need to reverse engineer some complex/ obfuscated javascripts.

2. This is more like a hybrid solution: you continue to use the browser but you check what http requests are not necessary (with fiddler or in the traffic tab of PM). You block the domains in zenno (use content policy block) from which un-necessary things are being downloaded (example: the domain from which the ads are serverd; the domain from which the images are served, which is usually a cloud storage etc). You can also block the domains/ URLs from which different javascripts are being sent to your website (this is useful when you don't want your browser to execute different javascripts and save processing resources; however, not executing the javascripts related to tracking/ fingerprinting will most certainly raise a red flag). After you finish blocking all the domains that are not necessary in your template,check how your task on that specific website works. Some websites might throw capchas or block your activity if too many things are blocked.

3. If possible, load that particular page with a profile on your home IP (or with a proxy that does not have bandwidth limitations), save the profile (make sure cache and cookies are saved, too) and then use that profile with the proxy that has bandwidth limitations. This doesn't really work with websites that have powerful tracking methods. A related method would be to "copy" the page you're loading frequesntly and re-create it on a different domain, making sure all identical elements are in your new page. Load that page on your domain before going to the target domain.


I would also like to know if there is an "out of the box" solution that does caching because none of the a/m ideas are perfect

Hi Aronax,

thank you very much for sharing your ideas and the detailed explanation.

Unfortunately I mostly work with browser based bots and have not much experience with http request ones. But I plan to go deeper in this area and hope this will solve some of my issues.

For the hybrid solution, I already tested this approach with some projects with mixed results. Some work without problems other sites are instantly blocking the accounts or reject the sign up.

So I would mainly move towards your 3rd approach. I thought about to make a "temp" folder within the project where all "non fingerprinting" elements are saved and reuse them later for every new account to create. But as I'm not very skilled with the different browser caching methods and how it is done for specific elements (images, js and so on) I considered to use some "caching proxy" software that takes this part. Currently checking "Wingate" if this could work or maybe a local Squid proxy to cache the static elements on the target site...
 
  • Спасибо
Реакции: Aronax

henry88

Client
Регистрация
31.12.2018
Сообщения
65
Благодарностей
22
Баллы
8
Здравствуйте, у меня такая же проблема, пытаюсь уменьшить трафик в жилых помещениях, делая изображения и CSS локально кэшированными, просто не могу начать в данный момент, интересно, решили ли вы эту проблему? Или у вас есть идея получше?
 

Кто просматривает тему: (Всего: 1, Пользователи: 0, Гости: 1)