Program: Python

Package:  BeautifulSoup

Recently, our client has requirements for collecting social ids from their public page. But, for me, I used Ruby as a default program a few years,  

So I take some days to learned python and write a simple script to this. 

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq # Web client
import re

def getLinks(url):
    uClient = uReq(url)
    page_soup = soup(uClient.read(), "html.parser")
    uClient.close()
    links = []

    full_links = page_soup.findAll('a', attrs={'href': re.compile("^http(s)?://")})
    for link in full_links:
        links.append(link.get('href'))

    return list(set(links))

print( getLinks("https://blog.icmoc.com") )
# ['https://twitter.com/ekohe', 'https://www.facebook.com/ekohe.co', 'https://www.linkedin.com/company/ekohe']

Notes:

  • BeautifulSoup - HTML data structure
  • uReq - Web client
  • re - Python Regex
  • set - Python Function (convert to hash and remove duplicate)
  • list - Python Function (easy to convert to Array)

Overall, I think it's very easy to implement it through python.