Wednesday, August 12, 2009

Grabbing links with lynx

Lynx is terminal web browser. We can use it however to grab all the links from given webpage. Let's try to grab all the links from linuxgazette.net:

$ lynx --dump --listonly linuxgazette.net

References

1. http://linuxgazette.net/lg.rss
2. http://linuxgazette.net/lg.rdf
3. http://linuxgazette.net/
4. http://linuxgazette.net/index.html
5. http://linuxgazette.net/
6. http://linuxgazette.net/faq/index.html
7. http://linuxgazette.net/lg_index.html
8. http://linuxgazette.net/mirrors.html
9. http://linuxgazette.net/mirrors.html
10. http://linuxgazette.net/search.html
11. http://linuxgazette.net/archives.html
12. http://linuxgazette.net/authors/index.html
13. http://lists.linuxgazette.net/mailman/listinfo/
14. http://linuxgazette.net/jobs.html
15. http://linuxgazette.net/contact.html
16. http://linuxgazette.net/
17. http://linuxgazette.net/165/index.html
18. http://linuxgazette.net/164/index.html
19. http://linuxgazette.net/163/index.html
20. http://linuxgazette.net/162/index.html
21. http://linuxgazette.net/archives.html
22. http://linuxgazette.net/current/TWDT165.pdb
23. http://linuxgazette.net/cgi-bin/pdb.cgi
24. http://linuxgazette.net/ftpfiles/lg-165.tar.gz
25. http://linuxgazette.net/ftpfiles/lg-base.tar.gz
26. http://linuxgazette.net/ftpfiles/lg-base-new.tar.gz
27. http://linuxgazette.net/ftpfiles/
28. http://linuxgazette.net/mirrors.html
29. http://linuxgazette.net/ftpfiles/README
30. http://linuxgazette.net/faq/general.html#ftp
31. http://linuxgazette.net/ftpfiles.txt
32. http://linuxgazette.net/mirrors.html#Arabic
33. http://linuxgazette.net/mirrors.html#French
34. http://linuxgazette.net/mirrors.html#Indonesian
35. http://linuxgazette.net/mirrors.html#Portuguese
36. http://linuxgazette.net/mirrors.html#Russian
37. http://linuxgazette.net/mirrors.html#Spanish
38. http://linuxgazette.net/mirrors.html#Thai
39. http://linuxgazette.net/lg.rss
40. http://linuxgazette.net/lg.rdf
41. http://linuxgazette.net/current/
42. http://fullhartsoftware.com/
43. http://linuxgazette.net/faq/author.html
44. http://linuxgazette.net/jobs.html
45. http://linuxgazette.net/faq/ask-the-gang.html
46. http://linuxgazette.net/faq/members-faq.html
47. http://www.tldp.org/
48. http://linuxgazette.net/
49. http://linuxgazette.net/copying.html
50. http://lwn.net/Articles/63383/
51. http://www.graphics-muse.com/
52. http://www.isc.tamu.edu/~lewing/linux/

The --dump switch dumps all content of the webpage, the --listonly switch make only links to show. This option also shows "hidden links" - links pointed by images and buttons. If we want only the pure list of the links we can additionally use the cut and tr commands to cut unnecessary characters. We will sort (with sort) them by name and eliminate repetitive ones (with uniq). Note: you always have to sort text before piping it to uniq command:

$ lynx --dump --listonly linuxgazette.net | grep http | cut -f2- -d'.' | tr -d ' ' | sort | uniq

http://fullhartsoftware.com/
http://linuxgazette.net/
http://linuxgazette.net/162/index.html
http://linuxgazette.net/163/index.html
http://linuxgazette.net/164/index.html
http://linuxgazette.net/165/index.html
http://linuxgazette.net/archives.html
http://linuxgazette.net/authors/index.html
http://linuxgazette.net/cgi-bin/pdb.cgi
http://linuxgazette.net/contact.html
http://linuxgazette.net/copying.html
http://linuxgazette.net/current/
http://linuxgazette.net/current/TWDT165.pdb
http://linuxgazette.net/faq/ask-the-gang.html
http://linuxgazette.net/faq/author.html
http://linuxgazette.net/faq/general.html#ftp
http://linuxgazette.net/faq/index.html
http://linuxgazette.net/faq/members-faq.html
http://linuxgazette.net/ftpfiles.txt
http://linuxgazette.net/ftpfiles/
http://linuxgazette.net/ftpfiles/README
http://linuxgazette.net/ftpfiles/lg-165.tar.gz
http://linuxgazette.net/ftpfiles/lg-base-new.tar.gz
http://linuxgazette.net/ftpfiles/lg-base.tar.gz
http://linuxgazette.net/index.html
http://linuxgazette.net/jobs.html
http://linuxgazette.net/lg.rdf
http://linuxgazette.net/lg.rss
http://linuxgazette.net/lg_index.html
http://linuxgazette.net/mirrors.html
http://linuxgazette.net/mirrors.html#Arabic
http://linuxgazette.net/mirrors.html#French
http://linuxgazette.net/mirrors.html#Indonesian
http://linuxgazette.net/mirrors.html#Portuguese
http://linuxgazette.net/mirrors.html#Russian
http://linuxgazette.net/mirrors.html#Spanish
http://linuxgazette.net/mirrors.html#Thai
http://linuxgazette.net/search.html
http://lists.linuxgazette.net/mailman/listinfo/
http://lwn.net/Articles/63383/
http://www.graphics-muse.com/
http://www.isc.tamu.edu/~lewing/linux/
http://www.tldp.org/

1 comment:

  1. just use --nonumbers as argument and it does the same thing as your grep

    ReplyDelete