What's a good recommended way to compare lists and remove duplicate items?

nycdude

Пользователь
Регистрация
06.05.2018
Сообщения
57
Благодарностей
1
Баллы
8
Hello all,

I'm scraping links and data and sometimes I'll scrape the same links again I want to ignore. I have a running txt file as the history of all links to make the comparison.

I know I can loop through the lists to make the comparisons and save or delete but it just seems like too much. Is there a simpler and faster way?

Thanks.
 

kveldulv

Client
Регистрация
08.05.2011
Сообщения
45
Благодарностей
16
Баллы
8
You can call grep from the command line, or within scripts.

Код:
Grep
grep -Fxvf blacklist.txt uniq.txt >> uniq-new.txt
grep -F -x -v -f fileB fileA >> fileC

Awk
# this matches whole lines
awk 'FNR==NR {hash[$0]; next} $0 in hash' filter.txt data.txt > matching.txt
awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > not_matching.txt

# remove all the lines that appear in file B from the file A.
awk 'NR==FNR{a[$0];next} !($0 in a)' fileB fileA
 
Последнее редактирование:
  • Спасибо
Реакции: spyder и nycdude

Grapidly

Новичок
Регистрация
16.10.2018
Сообщения
9
Благодарностей
2
Баллы
3
Hello all,

I'm scraping links and data and sometimes I'll scrape the same links again I want to ignore. I have a running txt file as the history of all links to make the comparison.

I know I can loop through the lists to make the comparisons and save or delete but it just seems like too much. Is there a simpler and faster way?

Thanks.
Did you figure out the best way to do this? To save time, I would think to just remove DUPs at the end of project in list processing function.

Curious to know when you ended up using.
 

VladZen

Administrator
Команда форума
Регистрация
05.11.2014
Сообщения
22 226
Благодарностей
5 844
Баллы
113

Grapidly

Новичок
Регистрация
16.10.2018
Сообщения
9
Благодарностей
2
Баллы
3

Кто просматривает тему: (Всего: 1, Пользователи: 0, Гости: 1)