Documentation and cleanup.

8 years ago · a69a12916e
10 changed files with 228 additions and 22 deletions
--- a/.flake8
+++ b/.flake8
@ -0,0 +1,5 @@
+[flake8]
+exclude = .git
+# F401: Unused imports used conditionally (so not really unused)
+# E501: 79 character limit is impractical with a 4 space indent...
+ignore = F401,E501
--- a/README.md
+++ b/README.md
@ -1,3 +1,86 @@
 # busybody

-Neighborhood watch for your SaaS apps.
+Neighborhood watch for your SaaS apps.
+
+## Setup
+
+### Python Environment
+
+Depending on your choice of system, please use the package manager of your choice to install `python-3.6`, `pip`, and `virtualenv`. 
+
+Now, selecting a directory that can host executable files, create a virtualenv:
+
+> $ virtualenv busybodyenv
+
+Once that is complete, move into the new `busybodyenv` (or whatever you named your virtualenv) directory and activate your new virtual environment:
+
+> $ . ./bin/activate
+
+Now that we're in the environment, please download `busybody` from your choice of source. For example:
+
+> $(busybodyenv) git clone git@github.com:bobthesecurityguy/busybodyy.git
+
+Now move into that new directory and install `busbody`'s requirements:
+
+> $(busybodyenv) pip install -r requirements.txt
+
+On certain systems, installing these dependencies from `pip` may fail. In that case, check your package manager for pre-built packages under that name and then re-run the above command until it succeeds. Those systems will generally need to instantiate the virtualenv with the `--system-site-packages` option.
+
+### Scaling
+
+`busybody` is designed to scale horizontally. The polling and analysis portions of the application can be run separately with the `--mode` flag. As such, pollers may be staggered or run on multiple machines. Additionally, the per-user model that is constructed lends itself naturally to sharding if the analysis function needs to be scaled.
+
+### Config File
+
+The `busybody` configuration file is a YAML config file that allows you to configure most settings within the script. Some settings are available at the command line (mostly runtime options like verbosity and log output file).
+
+The config file can be placed anywhere, but if a location is not given at runtime, the script will default to looking in `~/.config/busybody/config.yml` or `./config.yml` in that order for the config.
+
+The format of the config file is standard YAML formatting. Top-level dictionaries correspond to the major functions of `busybody`, and though there may be a few standard items within them, most are module specific and explained in the README.md files within the module directories.
+
+Top-level configuration items are:
+
+> pollers
+
+> persistence
+
+> analysis
+
+> notifiers
+
+As noted above, each of those should be a standard YAML dictionary, an example of which can be found in the example.yml file.
+
+There are a few lower-level configuration options of special note that will not be covered in the per-module READMEs as they apply either to the system as a whole, or to multiple types of modules. These are:
+
+> user\_domain
+
+This is a string entry within any module dictionary inside of the "analysis" top-level dictionary. This string provides a domain to append to the user names from the module to convert them into email-style strings. NOTE: This is a blunt tool that will be insufficient for many cases. It is applied prior to the below "user\_map", however, so it may be useful for an initial pass with later corrections. Generally, it is preferable to already have emails in the logs as they serve as a consistent cross-service user identifier.
+
+> user\_map
+
+This can be a dictionary within any module dictionary inside of the "analysis" top-level dictionary. It provides a mapping between a source user value (key) and a final user value (value). This can be useful, for example, if a user has signed up for a service with their personal email, so that you can continue to properly correlate those log events with their other entries from other services.
+
+> geoip
+
+This dictionary inside of the "analysis" top-level dictionary should contain two entries that point to databases provided by MaxMind. The "city\_db" entry should be a MaxMind city-resolution database. The "asn\_db" should be a MaxMind IP-\>ASN database. Both are freely available from [MaxMind's site](http://dev.maxmind.com/geoip/geoip2/geolite2/).
+
+> enabled
+
+This is a dummy dictionary entry that may be provided within any module that needs no further configuration, but that needs to be listed in a particular section. Any module that you wish to poll from **must** be listed inside the "pollers" top-level dictionary. Similarly, any module that you wish to perform analysis on **must** be listed inside of the "analysis" top-level dictionary.
+
+## Output
+
+It is recommended to run Busybody with the verbose flag and redirecting output to a file until you are certain that the configuration is correct. This is usually sufficient to identify issues with your configuration. Debug mode (`-vv`) logs a significant amount of information and may log sensitve information and should be used with care.
+
+As a note, the output of this program will be the result of statistical tests run on the input logs. Depending on the nature of your incoming logs, some things that may seem suspicious may not be sufficiently disctinct from the background noise for this program to alert on. Similarly some innocuous activities may be flagged because they are a significant deviation from what our model believes the norm to be.
+
+Interpretation of the output may require an analyst to review other entries from the user that has been flagged in order to determine the cause of the flag. Please bear that in mind and only take action against a flagged user account if further investigation shows that such action is merited. Just like an actual neighborhood watch, just taking reports at face value may lead to undesirable outcomes.
+
+## Changelog
+
+* 1.0 - Initial public release
+    * Functional ML core using isolation forest and a per-user IP location, ASN, and user-agent model.
+    * GSuite module for polling
+    * Slack module for polling and notifying
+    * Flatfile module for persistence
+    * Documentation
--- a/busybody.py
+++ b/busybody.py
@ -15,7 +15,7 @@ from sklearn.preprocessing import scale
 from sklearn.feature_extraction.text import TfidfVectorizer
 from sklearn.ensemble import IsolationForest

-program_version = "0.1"
+program_version = "1.0"

 logger = logging.getLogger(__name__)

@ -42,7 +42,7 @@ def poll(config):

 def load_historical(config):
    if "persistence" in config["active_modules"] and config["active_modules"]["persistence"]:
-        persist_module = getattr(sys.modules[config["active_modules"]["persistence"]], 
+        persist_module = getattr(sys.modules[config["active_modules"]["persistence"]],
                                 config["active_modules"]["persistence"])
        get_last_func = getattr(persist_module, "get_last")
        config = get_last_func(config)
@ -67,7 +67,9 @@ def preprocess(config, data):
            if filter_field:
                if event[filter_field] in poll_mod.FILTERED_EVENTS:
                    continue
-            if not ts_field in event or not event[ts_field] or not user_field in event or not event[user_field] or not ip_field in event or not event[ip_field] or not ua_field in event or not event[ua_field]:
+            if ts_field not in event or not event[ts_field] or user_field not in event or \
+               not event[user_field] or ip_field not in event or not event[ip_field] or \
+               ua_field not in event or not event[ua_field]:
                continue
            if type(event[ts_field]) == str:
                ts = datetime.timestamp(datetime.strptime(event[ts_field], '%Y-%m-%dT%H:%M:%S.%fZ'))
@ -108,13 +110,13 @@ def preprocess(config, data):

 def analyze(config, data):
    alerts = []
-    last_anlyzed = 0
+    last_analyzed = 0
    if "persistence" in config["active_modules"] and config["active_modules"]["persistence"]:
        persist_module = getattr(sys.modules[config["active_modules"]["persistence"]],
                                 config["active_modules"]["persistence"])
        last_analyzed_func = getattr(persist_module, "get_last_analyzed")
        persist_analyzed_func = getattr(persist_module, "persist_last_analyzed")
-        last_analyzed = last_analyzed_func(config)                
+        last_analyzed = last_analyzed_func(config)
    # get unique list of users across data
    unique_users = list(set([e[2] for e in data]))
    logger.debug("Unique users: %s" % len(unique_users))
@ -154,13 +156,19 @@ def analyze(config, data):
                flagged += 1
                alerts.append(user_events[ev_no][1])
        logger.debug("Processed %s: %s of %s flagged." % (user, flagged, len(user_events)))
-    for alert in alerts:
-        logger.info(alert)
+    if "notifiers" in config["active_modules"]:
+        for module in config["active_modules"]["notifiers"]:
+            notify_mod = getattr(sys.modules[module], module)
+            notify_func = getattr(notify_mod, "notify")
+            notify_func(config, alerts)
+    else:
+        for alert in alerts:
+            logger.info(alert)
    persist_analyzed_func(config, data[-1][0])
-    
+

 def latlon_to_xyz(lat, lon):
-    phi   = (90 - lat) * (numpy.pi / 180)
+    phi = (90 - lat) * (numpy.pi / 180)
    theta = (lon + 180) * (numpy.pi / 180)

    x = 0 - (numpy.sin(phi) * numpy.cos(theta))
@ -179,7 +187,7 @@ def load_config(config_path):
        if homeconfig.is_file():
            config_file = homeconfig
        elif scriptconfig.is_file():
-            config_file = sriptconfig
+            config_file = scriptconfig
        else:
            raise RuntimeError("No configuration file found.")
    with config_file.open() as f:
@ -189,14 +197,14 @@ def load_config(config_path):

 def load_modules(config):
    config["active_modules"] = {}
-    if not "pollers" in config or not config["pollers"]:
+    if "pollers" not in config or not config["pollers"]:
        raise RuntimeError("Polllers aren't optional.")
    config["active_modules"]["pollers"] = []
    for poller in config["pollers"]:
        importlib.import_module(poller, poller)
        config["active_modules"]["pollers"].append(poller)
    if config["mode"] is None or config["mode"] == "analyze":
-        if not "notifiers" in config or not config["notifiers"]:
+        if "notifiers" not in config or not config["notifiers"]:
            raise RuntimeError("Configured to analyze, but no notifiers in config file.")
        config["active_modules"]["notifiers"] = []
        for notifier in config["notifiers"]:
@ -204,15 +212,15 @@ def load_modules(config):
            config["active_modules"]["notifiers"].append(notifier)
    if "persistence" in config and config["persistence"]:
        if "module" in config["persistence"] and config["persistence"]["module"]:
-             importlib.import_module(config["persistence"]["module"],
-                                     config["persistence"]["module"])
-             config["active_modules"]["persistence"] = config["persistence"]["module"]
+            importlib.import_module(config["persistence"]["module"],
+                                    config["persistence"]["module"])
+            config["active_modules"]["persistence"] = config["persistence"]["module"]
        else:
            raise RuntimeError("Persistence is configured, but no module specified.")
    return config


-#INIT STUFF/CONTROL LOOP
+# INIT STUFF/CONTROL LOOP
 if __name__ == '__main__':
    parser = argparse.ArgumentParser(prog="Busybody",
                                     description="Neighborhood watch for your SaaS apps.")
--- a/example.yml
+++ b/example.yml
@ -0,0 +1,22 @@
+persistence:
+  module: flatfile
+  log_directory: /var/log/busybody 
+pollers:
+  slack:
+    api_token: xoxp-1234567890-1234567890-123456789012-deadbeef0000deadbeef0000deadbeef
+  gsuite:
+    admin_email: admin@example.com
+    credential_file: /etc/busybody/gsuite.json
+analysis:
+  slack:
+    user_map:
+      example.personaluser@gmail.com: user@example.com
+  gsuite:
+    enabled: True
+  geoip:
+    city_db: /etc/busybody/GeoLite2-City.mmdb
+    asn_db: /etc/busybody/GeoLite2-ASN.mmdb
+notifiers:
+  slack:
+    api_token: xoxb-123456789012-stringofradomlookingbits
+    channel: general
--- a/flatfile/README.md
+++ b/flatfile/README.md
@ -0,0 +1,11 @@
+# Setup
+
+The `flatfile` module requires no setup in order to operate, except that your calling user should have read and write permissions on the location where you wish to store logs.
+
+## Configuration File
+
+Flatfile only exists within the "persistence" top-level dictionary in the configuration file.
+
+The "persistence" dictionary may only contain one backend at a time. This is enforced by not having "flatfile" be a sub-dictionary, but rather by having a "module" key within the "persistence" dictionary that may take "flatfile" as a value. Other values used by `flatfile` are:
+
+> log\_directory          - The directory in which you would like to store log files.
--- a/flatfile/flatfile.py
+++ b/flatfile/flatfile.py
@ -5,6 +5,7 @@ from pathlib import Path

 logger = logging.getLogger(__name__)

+
 def get_last(config):
    if "log_directory" not in config["persistence"] or not config["persistence"]["log_directory"]:
        raise RuntimeError("Flat file persistence requested, but no log_directory specified.")
@ -50,7 +51,7 @@ def get_historical_data(config):
        raise RuntimeError("Flat file persistence requested, but no log_directory specified.")
    log_dir = Path(config["persistence"]["log_directory"])
    log_dir.mkdir(mode=0o775, parents=True, exist_ok=True)
-    if not "pollers" in config:
+    if "pollers" not in config:
        return data
    for module in config["active_modules"]["pollers"]:
        data[module] = []
@ -69,6 +70,7 @@ def get_historical_data(config):
                    data[module].append(event)
    return data

+
 def get_last_analyzed(config):
    if "log_directory" not in config["persistence"] or not config["persistence"]["log_directory"]:
        raise RuntimeError("Flat file persistence requested, but no log_directory specified.")
--- a/gsuite/README.md
+++ b/gsuite/README.md
@ -0,0 +1,19 @@
+# Setup
+
+The `gsuite` module requires service user credentials in order to operate.
+
+## Service User Generation
+
+To create the required service user, follow the instructions from [Google](https://developers.google.com/admin-sdk/reports/v1/guides/delegation) about setting up a service user and performing account-wide delegation. Make note of where you store the downloaded credential file.
+
+## Configuration File
+
+GSuite may exist under the "pollers" and/or "analysis" top-level dictionaries in the configuration file.
+
+The "gsuite" dictionary within the "pollers" dictionary may contain:
+
+> credential\_file         - The location of the JSON credentials file for the service user.
+
+> admin\_email             - The admin whose user should be assumed during polling.
+
+The "gsuite" dictionary within the "analysis" dictionary has no special options.
--- a/gsuite/gsuite.py
+++ b/gsuite/gsuite.py
@ -13,6 +13,7 @@ USER_AGENT_FIELD = "events.0.login_type"
 FILTER_FIELD = "events.0.name"
 FILTERED_EVENTS = ["login_failure"]

+
 def poll(config):
    data = []
    scopes = ['https://www.googleapis.com/auth/admin.reports.audit.readonly']
@ -51,7 +52,7 @@ def flatten(event, prefix=''):
    flattened = {}
    for field_no, field in enumerate(event):
        if 'keys' in dir(event):
-            #Special case "parameters" values. We should to treat those as dicts.
+            # Special case "parameters" values. We should to treat those as dicts.
            if field == "parameters":
                for param in event[field]:
                    if isinstance(param["value"], Iterable) and not isinstance(param["value"], str):
@ -59,7 +60,7 @@ def flatten(event, prefix=''):
                    else:
                        flattened[prefix + param["name"]] = param["value"]
                continue
-            else:        
+            else:
                nextLevel = event[field]
                currEntry = prefix + str(field)
        else:
--- a/slack/README.md
+++ b/slack/README.md
@ -0,0 +1,39 @@
+# Setup
+
+The `slack` module requires a set of API tokens in order to operate. The polling function requires a standard API token, while the notifier can operate with a "bot" token.
+
+## API Token Generation
+
+To create the required API tokens, direct your browser to the [Slack API Apps list](https://api.slack.com/apps) and (once signed-in to the appropriate team) click on the "Create New App" button. This should present you with a menu that allows you to set a name for the app (we recommend "Busybody") and select the team to enable it on.
+
+Once you enter that information, you will be taken to the detailed settings for your new app. Feel free to set the "Display Information" in a way that makes sense to you and ignore the "App Credentials" presented. Move down to "Bot Users", enable a bot user with the name of your choice, and then move back up to "OAuth & Permissions".
+
+In the OAuth section select the following permissions. In an effort to make you a bit more comfortable with granting these, each will be listed with a full description of what we use it for below:
+
+> admin            - Used to access user logs.
+
+> bot              - Used to act as the bot user.
+
+> chat:write:bot   - Used to send messages as our bot.
+
+> users:read       - Used to access the user list.
+
+> user:read:email  - Used to access the emails of users for correlation with other applications.
+
+Once you have granted those permissions, install the app and take note of the OAuth tokens that have been generated for you.i
+
+## Configuration File
+
+Slack may exist under the "pollers", "analysis", and/or "notifiers" top-level dictionaries in the configuration file.
+
+The "slack" dictionary within the "pollers" dictionary may contain:
+
+> api\_token        - Defines the API token used to poll for user logs and to add the email to logs.
+
+The "slack" dictionary within the "analysis" dictionary has no special options.
+
+The "slack" dictionary within the "notifiers" dictionary may contain:
+
+> api\_token        - Defines the API token ("bot" token) used to send notifications about alerts.
+
+> channel          - Defines the channel or user to send the notification to.
--- a/slack/slack.py
+++ b/slack/slack.py
@ -37,15 +37,31 @@ def poll(config):


 def notify(config, alerts):
+    if "channel" not in config["notifiers"]["slack"]:
+        raise RuntimeError("Slack configured to notify, but no channel specified.")
+    slack_api = SlackClient(config["notifiers"]["slack"]["api_token"])
+    heading = "Busybody has noted a suspicious event!"
+    for alert in alerts:
+        attachment = [{
+            "fallback": heading + "\n" + alert,
+            "title": heading,
+            "text": alert,
+            "color": "#ffe600"
+        }]
+        result = slack_api.api_call("chat.postMessage",
+                                    channel=config["notifiers"]["slack"]["output_channel"],
+                                    attachments=attachment, as_user=True)
+        check_api(result)
    return


 def check_api(data):
    if not data["ok"]:
-        raise RuntimeError("Slack API returned an error: "+str(data))
+        raise RuntimeError("Slack API returned an error: " + str(data))
    else:
        return

+
 def enrich(config, data):
    unique_users = list(set([e["user_id"] for e in data]))
    slack_api = SlackClient(config["pollers"]["slack"]["api_token"])
@ -62,7 +78,7 @@ def enrich(config, data):
        elif "profile" in user_info["user"] and "email" in user_info["user"]["profile"]:
            user_map[user] = user_info["user"]["profile"]["email"]
            logger.debug("Mapping user %s to %s." % (user, user_map[user]))
-    new_data = []    
+    new_data = []
    for entry in data:
        if entry["user_id"] in user_map:
            entry["email"] = user_map[entry["user_id"]]