Scraper
Scraping functions.
ScraperObject
Bases: object
Work In Progress. Do Not Use
A scraper that will parse sections and information from the retrieved files.
Attributes:
Name | Type | Description |
---|---|---|
headers |
Request headers to avoid bot detection for scraping |
Source code in sec_web_scraper/Scraper.py
bs4_scraping_text(string_inp)
A BeautifulSoup wrapper function for processing the text document we retrieved to utilize the lxml parser.
Work in Progress.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
string_inp |
str
|
An HTML txt file (the output of get_document_given_link) |
required |
Returns:
Type | Description |
---|---|
BeautifulSoup
|
A BeautifulSoup object |
Source code in sec_web_scraper/Scraper.py
create_selenium_browser_headless(sec_link='https://www.sec.gov/edgar/search/')
Creates a Selenium Headless Web Browser for Full Text Search.
The goal of this method is to perform full text search queries by using SEC EDGAR's full text page. There used to be a public API for this but it has been removed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sec_link |
str
|
The initial link for the Selenium Web Browser |
'https://www.sec.gov/edgar/search/'
|
Returns:
Type | Description |
---|---|
webdriver.chrome.webdriver.WebDriver
|
None |
Raises:
Type | Description |
---|---|
ConnectionError
|
requests couldn't get your link so Selenium browser not created |
Source code in sec_web_scraper/Scraper.py
get_company_filings_given_cik(cik)
Find a company submission history given CIK.
This method will look at the SEC EDGARs official submission history for a company based on the provided CIK.It will then get the JSON document containing this information.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cik |
str
|
A 10 digit unique string representing each public company.Can be retrieved using Downloader get_company_info. |
required |
Returns:
Type | Description |
---|---|
dict
|
A dict representing all the submission history for a partiular company (cik). |
dict
|
The dict contains keys such as entityType,tickers,exchanges,stateofIncoporation,addresses, etc |
dict
|
Empty dict is returned in the case that the CIK is invalid. |
Raises:
Type | Description |
---|---|
AssertionError
|
Length of CIKs must be 10. |
Please see this link for more information
|
Source code in sec_web_scraper/Scraper.py
get_document_given_link(link)
Retrieve a document given it's URL
Get the raw text of a document provided its URL
Parameters:
Name | Type | Description | Default |
---|---|---|---|
link |
str
|
A url string for .txt file found on a documents's index page. For example, https://www.sec.gov/Archives/edgar/data/20/0000893220-96-000500.txt that can be retrieved from the index page: https://www.sec.gov/Archives/edgar/data/20/0000893220-96-000500-index.html |
required |
Returns:
Type | Description |
---|---|
str
|
None if document doesn't exist or a text str contaning the text |
Source code in sec_web_scraper/Scraper.py
get_document_tags(txt)
Find all document tags inside of a document.
The document tags in a document are very helfpul for identifying change of sections in a document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
txt |
str
|
An HTML txt file (the output of get_document_given_link) |
required |
Returns:
Type | Description |
---|---|
list[tuple]
|
A list of tuples of the form : (start_tag_index,end_tag_index,tag_name). |
list[tuple]
|
One example: (1385 , 176135 , |
list[tuple]
|
ended at 176135. |
Source code in sec_web_scraper/Scraper.py
get_filings_by_query(query, driver)
Find the first 100 submissions that contain the specific query This method will look at the SEC EDGARs official database and perform a full-text-search on the provided query. #WARNING: sometimes it may return an empty DataFrame
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
str
|
A string query/keyword that you are looking for. |
required |
driver |
A selenium web driver created by create_selenium_browser_headless. |
required |
Returns:
Type | Description |
---|---|
pd.DataFrame
|
A pandas DataFrame containing the columns: ['File Type','CIK','Filename','Date Filed', 'Company Name','File Link'] |
pd.DataFrame
|
See pandas.DataFrame to learn more about Pandas DataFrames |
Source code in sec_web_scraper/Scraper.py
iterate_over_filings(filings)
Extract number of filings for a company.
This function will iterate over the filings dictionary and try to extract relevant information for the user. As of now, it only returns the total count of filings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filings |
dict
|
A dictionary retrieved from get_company_filings_given_cik. get_company_filings_given_cik will return a nested dictionary. One can pass in the "filings" key of the dictionary returned by get_company_filings_given_cik. |
required |
Returns:
Type | Description |
---|---|
int
|
A count of all the filings recorded for this company |