Scraper

Scraping functions.

`ScraperObject`

Bases: object

Work In Progress. Do Not Use

A scraper that will parse sections and information from the retrieved files.

Attributes:

Name	Type	Description
`headers`		Request headers to avoid bot detection for scraping

Source code in sec_web_scraper/Scraper.py

class ScraperObject(object):
    """Work In Progress. Do Not Use

    A scraper that will parse sections and information from the retrieved files.

    Attributes:
        headers: Request headers to avoid bot detection for scraping
    """

    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:94.0) Gecko/20100101 Firefox",
            "Accept": "application/json, text/javascript, */*; q=0.01",
        }

`bs4_scraping_text(string_inp)`

A BeautifulSoup wrapper function for processing the text document we retrieved to utilize the lxml parser.

Work in Progress.

Parameters:

Name	Type	Description	Default
`string_inp`	`str`	An HTML txt file (the output of get_document_given_link)	required

Returns:

Type	Description
`BeautifulSoup`	A BeautifulSoup object

Source code in sec_web_scraper/Scraper.py

def bs4_scraping_text(string_inp: str) -> BeautifulSoup:
    """A BeautifulSoup wrapper function for processing the text document we retrieved
    to utilize the lxml parser.

    Work in Progress.

    Args:
        string_inp: An HTML txt file (the output of get_document_given_link)

    Returns:
        A BeautifulSoup object
    """

    text = BeautifulSoup(string_inp, "lxml")
    return text

`create_selenium_browser_headless(sec_link='https://www.sec.gov/edgar/search/')`

Creates a Selenium Headless Web Browser for Full Text Search.

The goal of this method is to perform full text search queries by using SEC EDGAR's full text page. There used to be a public API for this but it has been removed.

Parameters:

Name	Type	Description	Default
`sec_link`	`str`	The initial link for the Selenium Web Browser	`'https://www.sec.gov/edgar/search/'`

Returns:

Type	Description
`webdriver.chrome.webdriver.WebDriver`	None

Raises:

Type	Description
`ConnectionError`	requests couldn't get your link so Selenium browser not created

Source code in sec_web_scraper/Scraper.py

def create_selenium_browser_headless(
    sec_link: str = 'https://www.sec.gov/edgar/search/',
) -> webdriver.chrome.webdriver.WebDriver:
    """Creates a Selenium Headless Web Browser for Full Text Search.

    The goal of this method is to perform full text search queries by using SEC EDGAR's full text page.
    There used to be a public API for this but it has been removed.

    Args:
        sec_link: The initial link for the Selenium Web Browser

    Returns:
        None

    Raises:
        ConnectionError: requests couldn\'t get your link so Selenium browser not created
    """
    r = requests.get(sec_link, headers)
    if r.ok:
        print("Good")
        chrome_options = Options()
        chrome_options.add_argument("--headless=new")
        service = Service(ChromeDriverManager().install())

        driver = webdriver.Chrome(service=service, options=chrome_options)
        # driver = webdriver.Chrome(service=service)
        return driver
    else:
        raise ConnectionError('requests couldn\'t get your link so Selenium browser not created')

`get_company_filings_given_cik(cik)`

Find a company submission history given CIK.

This method will look at the SEC EDGARs official submission history for a company based on the provided CIK.It will then get the JSON document containing this information.

Parameters:

Name	Type	Description	Default
`cik`	`str`	A 10 digit unique string representing each public company.Can be retrieved using Downloader get_company_info.	required

Returns:

Type	Description
`dict`	A dict representing all the submission history for a partiular company (cik).
`dict`	The dict contains keys such as entityType,tickers,exchanges,stateofIncoporation,addresses, etc
`dict`	Empty dict is returned in the case that the CIK is invalid.

Raises:

Type	Description
`AssertionError`	Length of CIKs must be 10.
`Please see this link for more information`	https://data.sec.gov/submissions/CIK0000320193.json

Source code in sec_web_scraper/Scraper.py

def get_company_filings_given_cik(cik: str) -> dict:
    """Find a company submission history given CIK.

     This method will look at the SEC EDGARs official submission history for a company
     based on the provided CIK.It will then get the JSON document containing this information.

    Args:
        cik: A 10 digit unique string representing each public company.Can be retrieved using Downloader
            get_company_info.

    Returns:
        A dict representing all the submission history for a partiular company (cik).
        The dict contains keys such as entityType,tickers,exchanges,stateofIncoporation,addresses, etc

        Empty dict is returned in the case that the CIK is invalid.
    Raises:
        AssertionError: Length of CIKs must be 10.

        Please see this link for more information: https://data.sec.gov/submissions/CIK0000320193.json
    """
    assert len(cik) == 10
    link = f"https://data.sec.gov/submissions/CIK{cik}.json"
    r = requests.get(link, headers=headers)
    if r.ok:
        company_filling_cik = json.loads(r.text)
        print(company_filling_cik["sicDescription"])

        return company_filling_cik
    else:
        return {}

`get_document_given_link(link)`

Retrieve a document given it's URL

Get the raw text of a document provided its URL

Parameters:

Name	Type	Description	Default
`link`	`str`	A url string for .txt file found on a documents's index page. For example, https://www.sec.gov/Archives/edgar/data/20/0000893220-96-000500.txt that can be retrieved from the index page: https://www.sec.gov/Archives/edgar/data/20/0000893220-96-000500-index.html	required

Returns:

Type	Description
`str`	None if document doesn't exist or a text str contaning the text

Source code in sec_web_scraper/Scraper.py

def get_document_given_link(link: str) -> str:
    """Retrieve a document given it's URL

    Get the raw text of a document provided its URL

    Args:
        link: A url string for .txt file found on a documents's index page.
            For example, https://www.sec.gov/Archives/edgar/data/20/0000893220-96-000500.txt that can be retrieved
            from the index page: https://www.sec.gov/Archives/edgar/data/20/0000893220-96-000500-index.html
    Returns:
        None if document doesn't exist or a text str contaning the text
    """

    print(link)
    print(headers)
    r = requests.get(link, headers=headers)
    if r.ok:
        print("ok")
        return r.text
    else:
        return None

`get_document_tags(txt)`

Find all document tags inside of a document.

The document tags in a document are very helfpul for identifying change of sections in a document.

Parameters:

Name	Type	Description	Default
`txt`	`str`	An HTML txt file (the output of get_document_given_link)	required

Returns:

Type	Description
`list[tuple]`	A list of tuples of the form : (start_tag_index,end_tag_index,tag_name).
`list[tuple]`	One example: (1385 , 176135 , 10-K) means the 10-K section start at index 1385 and
`list[tuple]`	ended at 176135.

Source code in sec_web_scraper/Scraper.py

def get_document_tags(txt: str) -> list[tuple]:
    """Find all document tags inside of a document.

    The document tags in a document are very helfpul for identifying change of sections
    in a document.
    Args:
        txt: An HTML txt file (the output of get_document_given_link)

    Returns:
        A list of tuples of the form : (start_tag_index,end_tag_index,tag_name).
        One example: (1385 , 176135 , <TYPE>10-K) means the 10-K section start at index 1385 and
        ended at 176135.
    """
    try:
        doc_start = re.compile(r"<DOCUMENT>")
        doc_end = re.compile(r"</DOCUMENT>")
        doc_type = re.compile(r"<TYPE>[^\n]+")

        beg_seq = []
        for y in doc_start.finditer(txt):
            beg_seq.append(y.end())
        end_seq = []
        for y in doc_end.finditer(txt):
            end_seq.append(y.start())

        type_list = []
        for y in doc_type.findall(txt):
            type_list.append(y)

        results = []
        for (
            x,
            y,
            z,
        ) in zip(beg_seq, end_seq, type_list):
            results.append((x, y, z))
            print(f'This is x, y, z: {x} , {y} , {z}')
        return results
    except TypeError as t:
        print(f'Error : {t}')
        return None

`get_filings_by_query(query, driver)`

Find the first 100 submissions that contain the specific query This method will look at the SEC EDGARs official database and perform a full-text-search on the provided query. #WARNING: sometimes it may return an empty DataFrame

Parameters:

Name	Type	Description	Default
`query`	`str`	A string query/keyword that you are looking for.	required
`driver`		A selenium web driver created by create_selenium_browser_headless.	required

Returns:

Type	Description
`pd.DataFrame`	A pandas DataFrame containing the columns: ['File Type','CIK','Filename','Date Filed', 'Company Name','File Link']
`pd.DataFrame`	See pandas.DataFrame to learn more about Pandas DataFrames

Source code in sec_web_scraper/Scraper.py

def get_filings_by_query(query: str, driver) -> pd.DataFrame:
    """Find the first 100 submissions that contain the specific query
     This method will look at the SEC EDGARs official database and perform a full-text-search on the provided query.
     #WARNING: sometimes it may return an empty DataFrame

    Args:
        query:A string query/keyword that you are looking for.
        driver: A selenium web driver created by create_selenium_browser_headless.

    Returns:
        A pandas DataFrame containing the columns:
           ['File Type','CIK','Filename','Date Filed', 'Company Name','File Link']
        See [pandas.DataFrame][] to learn more about Pandas DataFrames

    Raises:
        An Exception if DataFrame can't be generated after 10 iterations of selenium get.

    """
    # service = Service(ChromeDriverManager().install())
    # chrome_options = Options()
    # chrome_options.add_argument("--headless=new")
    # driver = webdriver.Chrome(service=service,options=chrome_options)
    list_return = []
    found = False
    i = 0
    while found is False and i != 15:
        driver.get(f'https://www.sec.gov/edgar/search/#/q={query}')
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        # search = soup.find_all('tr')
        search = soup.find(id='hits')
        new_search = search.find_all('tr')[1:]
        link_ = 'https://www.sec.gov/Archives/edgar/data/'
        column_names = ['File Type', 'CIK', 'Filename', 'Date Filed', 'Company Name', 'File Link']
        for ele in new_search:
            ele_attr = ele.find_all("td")
            # File name :
            file_type = ele_attr[0].text
            submission_num = ele_attr[0].a['data-adsh'].replace('-', '')
            file_name = ele_attr[0].a['data-file-name']
            cik = ele_attr[4].text.split()[1]
            file_link = f'{link_}/{cik}/{submission_num}/{file_name}'
            company_name = ele_attr[3].text
            date_filed = ele_attr[1].text
            list_return.append([file_type, cik, file_name, date_filed, company_name, file_link])
            # keep only file_name, file_type, cik, file_link,name
        driver.implicitly_wait(5)
        found = len(list_return) != 0
        i += 1

    # if i == 10:
    #    raise Exception("Tried 10 times, could not generate DataFrame")
    return pd.DataFrame(list_return, columns=column_names)

`iterate_over_filings(filings)`

Extract number of filings for a company.

This function will iterate over the filings dictionary and try to extract relevant information for the user. As of now, it only returns the total count of filings.

Parameters:

Name	Type	Description	Default
`filings`	`dict`	A dictionary retrieved from get_company_filings_given_cik. get_company_filings_given_cik will return a nested dictionary. One can pass in the "filings" key of the dictionary returned by get_company_filings_given_cik.	required

Returns:

Type	Description
`int`	A count of all the filings recorded for this company

Source code in sec_web_scraper/Scraper.py

def iterate_over_filings(filings: dict) -> int:
    """Extract number of filings for a company.

    This function will iterate over the filings dictionary and try to extract
    relevant information for the user. As of now, it only returns the total count
    of filings.

    Args:
        filings: A dictionary retrieved from get_company_filings_given_cik.
            get_company_filings_given_cik will return a nested dictionary.
            One can pass in the "filings" key of the dictionary returned by get_company_filings_given_cik.

    Returns:
        A count of all the filings recorded for this company
    """
    print(filings.keys())
    for k, v in filings["recent"].items():
        print(f"This is the key {k} and the item length: {len(v)} and type : {type(v)}")
    for j in filings["files"]:
        print(j)
        print(type(j))
    print("------")
    return filings["files"][0]["filingCount"]