This is the metadata file for the scraped hyperlink network data of websites belonging to organizations listed as working within the NYC borough of Staten Island in the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017).   

1. Outline.
   This metadata file contains the following sections:
   - description of data
   - key words
   - date of last update 
   - data dictionary 
   - references

2. Description of data.
The data represent web-scraping of hyperlinks among a purposefully selected, geographically bounded selection of environmental stewardship organizations. Organizations were selected from from the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017), a publicly available data set about environmental stewardship organizations working in New York City, USA (N = 719). STEW-MAP includes geo-spatial information about where organizations work, what they do, and who they work with. 

Since hyperlink web-scraping can be computationally slow (Issuecrawler 2021), we took a subset of the data for analysis. Using the ‘select by location’ feature in ArcGIS version 10.6.1, we selected all organizations that intersected (i.e., worked entirely within or overlapped) the NYC borough of Staten Island for a geographically bounded sample (n = 111). We applied a negative 250m buffer to omit organizations that would have only been included due to a small overlap, most likely due to inaccuracies in the spatial data. 

The STEW-MAP data included organizations’ websites. We verified all websites using Google web searches and looked up alternative sites for non-working, redirected, or missing URLs. This resulted in 86 working websites; however, eight sites could not be scraped, which we removed from our final sample (n = 78). 

We used the snaWeb package (version 1.0.1, Stockton 2020) in the R computational language environment (R Core Team 2020) to collect hyperlink network data. The snaWeb package is a web-scraper with a set of functions to retrieve URLs from specified websites and build hyperlink networks. snaWeb works as follows: 

snaWeb searches a site to any specified search depth, which can be thought of as the fewest number of clicks a user must navigate through in a website to arrive at a given sub-page within that website. Depth starts at one and each click is an additional depth. We scraped the 78 websites between 09 and 17 June 2020 to a maximum search depth of ten, expecting most, if not all sites, to have a maximum depth below ten (see appendix for additional specification). For each site, data was collected at each incremental depth and compiled. 

snaWeb calls sites on their web server, requests and retrieves basic information such as the site’s name, if it is working (e.g., URL status code 200 vs. 404 or other errors), whether the URL is internal or external to the scraped site (i.e., having the same or different root than the searched site), and returns any redirected URLs. The ability to find redirects is an important behavior for network studies, as two sites with hyperlinks to a common third site will be connected to this third site even if one site uses an outdated URL, which is a frequent issue on the web (Dellavalle et al. 2003, Duda and Camp 2008, Hennessey and Ge 2013, Jones et al. 2016, Hondula 2020).

snaWeb facilitates searching organizational sub-programs. Many large environmental organizations, such as government agencies or large non-profits, consist of sub-programs that in many ways, function more like independent programs than a single entity (Sayles and Baggio 2017, Sayles 2018, Newig et al. 2010). For the purpose of understanding environmental governance systems, it often makes sense to treat these sub-programs as different groups. For example, when looking at stakeholders in the Northeastern United States, it is logical to include the U.S. Environmental Protection Agency (EPA) Region One, which works in the region, but not Region Ten, which operates on the other side of the continent. Both regions, however, have the same root URL (www.epa.gov). snaWeb uses the full URL that is entered for the search (e.g., www.epa.gov/aboutepa/epa-region-1-new-england) as the search base. Sub-pages of this base are classified as internal sub-pages and scraped. Pages at the same level or higher (e.g., www.epa.gov/aboutepa/epa-region-10-pacific-northwest, or simply www.epa.gov) are classified as family pages having the same root, so technically internal, but not sub-pages, and are not scraped. This search behavior attempts to more accurately represent the structure and reality of networked environmental governance, as opposed to ignoring sub-strings or recoding them to the root URL as has been done elsewhere (e.g. Kreakie et al. 2016). In the Staten Island STEW-MAP data there were eight organizations that self-identified by sub-pages.

Please see further details in the appendix of Sayles et al. (in review) "How deep to dig: effects of web-scraping search depth on hyperlink network analysis of environmental stewardship organizations" for further details about the web-scraping and possible limitations of the snaWeb package and it's returns. 


3. Key words.
Social network analysis, SNA, hyperlink networks, web-scraping, environmental governance, decision support tools, environmental stewardship


4. Date of last update.
26 May 2021


5. Data dictionary.
The following are column names in the data set and their definitions.

	X: The row index from the original scrape of sites. Sites were run in batches to reduce computational burden. As such, these row indices are not sequential and have little value for any analysis. They should be ignored.  
	
	searchurl: The URL that was used to initial the scrape of a given site.  
	
	url: Scraped URL
	
	status: Hypertext Transfer Protocol (HTTP) response status code of the URL 
	
	access_date: Date and time the URL was accessed
	
	access_time: Amount of time it took to access the URL
	
	type: URL content type
	
	title: web-page title
	
	internal: (TRUE / FALSE) If the web-page is internal to the website hosting the root URL. This is based on the root. If search and scraped URLs have the same root, then they are internal (internal = TRUE).
	
	subpage: (TRUE / FALSE) If the web-page is a sub-page of the search URL. If the scraped URL is a continuation of the search URL's path/file name, then it is a sub-page (subpage = TRUE). Example: www.website.com/subpage is a subpage of www.website.com. www.website.com/subpage/subpage2 is a subpage of both the former sites. A site can share a root url (i.e., be internal) but not a sub-page. E.g., www.webpage.com/other is internal to www.website.com/subpage, but not a subpage.  
	
	n_returns: number of hyperlinks that were scraped from a given URL. Only the search URL and its subpages are scraped. 
	
	scrape_time: amount of time it took to scrape a site (in seconds)
	
	depth: the search depth at which the URL was found. Search depth can be thought of as the fewest number of clicks a user must navigate through in a website to arrive at a given sub-page within that website. Depth starts at one and each click is an additional depth. The search URL depth is zero.

6. References
Dellavalle, R. P., Hester, E. J., Heilig, L. F., Drake, A. L., Kuntzman, J. W., Graber, M., & Schilling, L. M. (2003). Going, Going, Gone: Lost Internet References. Science, 302(5646), 787–788.

Duda, J. J., & Camp, R. J. (2008). Ecology in the Information Age: Patterns of Use and Attrition Rates of Internet-Based Citations in ESA Journals, 1997-2005. Frontiers in Ecology and the Environment, 6(3), 145–151. http://www.jstor.org/stable/20440844

Hennessey, J., & Ge, S. X. (2013). A cross disciplinary study of link decay and the effectiveness of mitigation techniques. BMC Bioinformatics, 14(SUPPL.14), S5. https://doi.org/10.1186/1471-2105-14-S14-S5

Hondula, K. L. (2020). Shiny App Accessibility, Part 1: Only You Can Prevent Link Rot. SESYNC Cyberhelp for Researchers & Teams Blog. https://cyberhelp.sesync.org/blog/shiny-in-pubs.html#fn:2

Issuecrawler. (2021). Issuecrawler instructions for use. www.govcom.org/Issuecrawler_instructions.htm

Jones, S. M., De Sompel, H. Van, Shankar, H., Klein, M., Tobin, R., & Grover, C. (2016). Scholarly context adrift: Three out of four URI references lead to changed content. In PLoS ONE (Vol. 11, Issue 12). https://doi.org/10.1371/journal.pone.0167475

Kreakie, B. J., Hychka, K. C., Belaire, J. A., Minor, E., & Walker, H. A. (2016). Internet-Based Approaches to Building Stakeholder Networks for Conservation and Natural Resource Management. Environmental Management, 57(2), 345–354. https://doi.org/10.1007/s00267-015-0624-8

Newig, J., Günther, D., & Pahl-wostl, C. (2010). Synapses in the Network: Learning in Governance Networks in the Context of Environmental Management. Ecology And Society, 15(4), 24.

R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Version 4.0.3.

Sayles, J. S. (2018). Effects of social-ecological scale mismatches on estuary restoration at the project and landscape level in puget sound, USA. Ecological Restoration, 36(1), 62–75. https://doi.org/10.3368/er.36.1.62c

Sayles, J. S., & Baggio, J. A. (2017a). Who collaborates and why: Assessment and diagnostic of governance network integration for salmon restoration in Puget Sound , USA. Journal of Environmental Management, 186, 64–78. https://doi.org/10.1016/j.jenvman.2016.09.085

Stockton, T. (2020). snaWeb Package: An R package for finding and building social networks for a website, version 1.0.1.

USDA Forest Service. (2017). Stewardship Mapping and Assessment Project (STEW-MAP). New York City Data Set. Available online at https://www.nrs.fs.fed.us/STEW-MAP/data/.