SharePoint Fetcher Background Information#
The fetcher makes a request against the SharePoint REST API, downloads all the specified files and folders present inside the directory and saves it to the evidence path. The evidence path is set during the execution of a run and read as an environment variable by the sharepoint-fetcher.
Note
The SharePoint Fetcher currently supports Kerberos authentication for the on-premise SharePoint instances and authentication via the use of an app registration to access OAuth-protected sites at mycompany.sharepoint.com
for the cloud instances!
Prerequisites#
For the on-premise SharePoint, the user id that is set for the fetcher in SHAREPOINT_FETCHER_USERNAME
must have access to the project site and defined project directory in order for the fetcher to work.
For the cloud SharePoint, the AAD Registration app needs to have all the permissions in place so that the data can be fetched from the SharePoint sites. For more information on how you can set up the App Registration, please check How to configure an Azure App Registration required for authentication.
Environment variables#
Considering fetching data from both on-premise and cloud SharePoint instances requires the use of different environment variables, for simplicity, we are going to split the variables into variables used by both fetch processes and also specific ones.
Attention
Make sure to include all the mandatory environment variables for your use case (both common and specific mandatory variables)!
Common environment variables#
Attention
For the SharePoint Fetcher, you must include either the SHAREPOINT_FETCHER_PROJECT_URL
environment variable or together the SHAREPOINT_FETCHER_PROJECT_SITE
and SHAREPOINT_FETCHER_PROJECT_PATH
environment variables.
- SHAREPOINT_FETCHER_PROJECT_URL#
(Optional) This variable contains the full URL of the SharePoint file or folder you want to download. The correct format of the URL should include the project site and file/folder path. For more information about the URL and how to obtain it please visit How to get the URL of a file or folder from cloud SharePoint or How to get the URL of a file or folder from on-premise SharePoint.
Note
Please note that the
SHAREPOINT_FETCHER_PROJECT_URL
environment variable takes precedance over theSHAREPOINT_FETCHER_PROJECT_SITE
andSHAREPOINT_FETCHER_PROJECT_PATH
environment varibales. Therefore, it is recommended to primarily use the URL.
- SHAREPOINT_FETCHER_PROJECT_SITE#
(Optional) This variable contains the URL of the SharePoint site from which you want to download some files. (Note: a SharePoint URL consists of two parts: first comes the site URL, followed by the path to the actual file or folder. The path part is stored in the
SHAREPOINT_FETCHER_PROJECT_PATH
variable, whereas the site part is stored in this variable).The site URL usually follows the following pattern:
for on-premise instances:
https://{hostname}/sites/{site}/
, for example https://sites.inside-share2.org.com/sites/1234567/for cloud SharePoint instances:
https://mycompany.sharepoint.com/sites/{site}/
, for example https://mycompany.sharepoint.com/sites/msteams_5xxxxxx5/
The site link must be given in the correct format as shown above because the internal logic depends on it.
- SHAREPOINT_FETCHER_PROJECT_PATH#
(Optional) This variable contains the path to the folder inside SharePoint which contains the files you want to fetch, e.g.,
Documents/fossid-tools-report-ok/
.The first part of this path contains the name of the root folder of your SharePoint site. Usually, this is
Documents
for on-premise andShared Documents
for cloud instances, but check the URL to your file if you are unsure.If you want to download only a single file, you can specify the file path instead of its parent directory.
If the given path ends with a slash (
/
), it is assumed that the path points to a directory. If it doesn’t end with a slash, the argument is assumed to point to a single file.- SHAREPOINT_FETCHER_IS_CLOUD#
(Optional) If this variable is set to
True
,true
or1
the connection to a cloud SharePoint instance is possible. If set toFalse
,false
or0
the SharePoint fetcher will work with an on-premise instance.
- SHAREPOINT_FETCHER_DESTINATION_PATH / SHAREPOINT_FETCHER_OUTPUT_DIR#
(Optional) Both names can be used interchangeably and have the same effect. This variable specifies the path to the destination folder where the fetcher should save the downloaded content. By default, the value is set to the current working directory.
- SHAREPOINT_FETCHER_CONFIG_FILE#
(Optional) This variable contains the path to the config file of the fetcher. This config file is optional. You can find more information below in the section The fetcher’s config file.
- SHAREPOINT_FETCHER_FILTER_CONFIG_FILE#
(Optional) This variable contains the path to the filter config file of the fetcher. Unlike in most other cases, this config file is optional. You can find more information below in the section The fetcher’s filter config file.
- SHAREPOINT_FETCHER_DOWNLOAD_PROPERTIES_ONLY#
(Optional) If you are not interested in the files’ contents but only in their properties, you can save resources and bandwidth by only downloading file properties and not the files themselves. However, beware that you can not evaluate files, if you haven’t downloaded them. The properties file on its own is not sufficient for that.
Simply set this variable to “1” or “true” to disable file downloading.
- SHAREPOINT_FETCHER_FORCE_IP#
(Optional) In case the name resolution of the SharePoint site is faulty, you can override the DNS name resolution by providing a custom IP address which will then be used instead of the DNS-resolved IP address for the given hostname.
For example, if the SharePoint site is at
https://my.sharepoint.site/sites/
(with IP address1.2.3.4
), but the DNS server reports a faulty9.9.9.9
, you can setSHAREPOINT_FETCHER_FORCE_IP=1.2.3.4
which will then cause the SharePoint fetcher to get data fromhttps://1.2.3.4/sites/...
instead ofhttps://9.9.9.9/sites/...
.Note
Sometimes it is also required to define
no_proxy=1.2.3.4
so that the connection to1.2.3.4
is not routed through the default proxy server, which might then route the request over some external network.In Downloading and checking files from on-premise instances, you can find more information on finding out about IP addresses of different SharePoint sites.
On-premise SharePoint environment variables#
- SHAREPOINT_FETCHER_USERNAME#
This variable contains the username of the account that is used to access the SharePoint server.
- SHAREPOINT_FETCHER_PASSWORD#
This variable must contain the password of the user given in
SHAREPOINT_FETCHER_USERNAME
.
Note
The above information is required for Kerberos authentication to SharePoint on-premise servers.
- SHAREPOINT_FETCHER_CUSTOM_PROPERTIES#
(Optional) SharePoint supports custom properties for files and folders. Examples are things like Confidentiality Class or Workflow Status. For these properties, an enum of values exists, stored in a SharePoint List.
For example, a list with title
RevisionStatus
for a custom propertyRevision Status
could have values likeValid
orDraft
. The API URL for accessing this list and its items would be similar to this one:https://some.sharepoint.server/sites/144287/_api/web/Lists/GetByTitle('RevisionStatus')/items
If you want to retrieve the custom properties with human-readable titles instead of some weird integer IDs, you need to provide a mapping from the list title to the title property of the list items which contains the human-readable names.
You need to provide three names:
The name of the file/folder property. E.g.
WorkOnStatusId
.The name of the SharePoint list belonging to this custom property. E.g.
WorkOn Status
.The name of the property which contains the list item title. E.g.
WorkOnStatus
.
These three names are then bundled into one mapping, e.g.
"WorkOnStatusId=>WorkOn Status=>WorkOnStatus"
. You can combine multiple mappings by separating them via|
character.If you want to find out how you can get those names, check Setting up custom property mappings.
Cloud SharePoint environment variables#
- SHAREPOINT_FETCHER_TENANT_ID#
This variable contains the value for the Directory (tenant) id for the Azure Active Directory Registration App.
- SHAREPOINT_FETCHER_CLIENT_ID#
This variable contains the value for the Application (client) id for the Azure Active Directory Registration App.
- SHAREPOINT_FETCHER_CLIENT_SECRET#
This variable contains the value for the client secret created in the Azure Active Directory Registration App.
Note
The above information is required for the OAuth to access cloud-protected SharePoint sites.
Attention
Do not forget to set the SHAREPOINT_FETCHER_IS_CLOUD
variable to “True” in order to fetch data from cloud SharePoint instances, otherwise the fetcher will want to fetch data from on-premise and won’t work properly!
The fetcher’s config file#
The purpose of a configuration file is to gather all the settings needed for the fetcher to operate correctly in a single file, rather than using multiple environment variables. The configuration file must follow YAML syntax and consist of variable: value
pairs.
The content of the file uses a naming convention derived from the corresponding environment variables. For example, the environment variable SHAREPOINT_FETCHER_IS_CLOUD
would be written as is_cloud
in the configuration file.
Regarding precedence, environment variables take priority over configuration variables from the file. For instance, if the configuration file has the variable project_site
with the value https://mycompany.sharepoint.com/sites/123
, but the environment variable SHAREPOINT_FETCHER_PROJECT_SITE
is set to https://mycompany.sharepoint.com/sites/456
, the autopilot will fetch data from the site specified by the environment variable.
It looks like the following:
destination_path: <SomePath>
is_cloud: True / False
project_path: <SomeProjectPath>
project_site: <SomeProjectSite>
project_url: <SomeProjectUrl>
username: <username>
password: <password>
tenant_id: <tenantId>
client_id: <clientId>
client_secret: <clientSecret>
force_ip: <forceIp>
custom_properties: <customProperties>
download_properties_only: True / False
sharepoint_file: <sharepointFile>
filter_config_file: <filterConfigFilePath>
Attention
The example above contains all the possible variables. You do not need to include all of them in your configuration file, only those that are relevant to your use case. Make sure to include all mandatory variables when fetching from an on-premise or cloud SharePoint instance!
Note
For more information about the logic behind each file variable, refer to the corresponding environment variable description.
The fetcher’s filter config file#
By default, all of the files, specified in the SHAREPOINT_FETCHER_PROJECT_PATH
, are going to be downloaded. However, instead of downloading all of the files, it is also possible to only select/filter for certain files, based on filename patterns or SharePoint properties.
The content of the filter config file is similar to what the SharePoint Evaluator expects.
It looks like the following:
- files: 'OneDriving_Q-Activity-List.xlsx(1)/*' # ← first comes a `files` section with a wildcard
title: 'File link title' # ← needed if you want a message with the link of the selected file
select: # ← list of property checks, similar to Sharepoint evaluator
- property: 'CSC'
equals: 2
onlyLastModified: true # ← (optional) if true, download only the most recent file
- files: 'PlainFile.pdf' # ← also possible to just specify a single filename
- files: 'Folder/*.pdf' # ← you can also use wildcards without selectors
Example config#
You can find a complete example configuration here:
for on-premise instances: Downloading and checking files from on-premise instances.
for cloud instances: Downloading and checking files from cloud instances.