Parse multiple sites IIS logs with Python

You know the struggle – there are multiple websites on the same IIS server and there is no easy way to analyze all logs at once. From time to time I notice that there is one particular IP that is filling the bandwidth of our servers, and I want to check what sites are used by this IP, what URLs are called and so on. To perform such operations quickly and easily, I decided to use Python.

The solution

import os
import glob
import pandas as pd

# modify the directory to point to the proper log dir
sourceDir="T:\IISLogFiles"

# modify to handle all log fields present in your logs
log_field_names = ['date', 'time', 's-ip', 'cs-method', 'cs-uri-stem', 'cs-uri-query', 's-port', 'cs-username', 'c-ip',
                   'cs(User-Agent)', 'cs(Referer)', 'sc-status', 'sc-substatus', 'sc-win32-status', 'time-taken']

df = pd.DataFrame()

for dirname, dirnames, filenames in os.walk(sourceDir):

    # traverse through directories
    for subdirname in dirnames:

        # skip FTP log directories - we don't want to parse FTP logs
        if subdirname[:3] != "FTP":
            list_of_files = glob.glob(sourceDir + "\\" + subdirname + "\*.log")

            # perform data gathering only if there are log files in the subdirectory
            if len(list_of_files) > 0:

                # find the newest log file
                latest_file = max(list_of_files, key=os.path.getctime)

                # import the log data into temporary dataframe
                df_temp = pd.read_csv(latest_file, sep=' ', comment='#', engine='python', names=log_field_names)

                # fill website identifier for further use
                df_temp["website"] = subdirname

                # append imported data to the main dataframe
                df = df.append(df_temp, ignore_index=True)


print("Imported data breakdown: ")
print(df.info())

print("The most frequent IPs")
print(df["c-ip"].value_counts())
print("The most frequent User Agents")
print(df["cs(User-Agent)"].value_counts())
print("The most frequent websites in the logs")
print(df["website"].value_counts())
print("The most frequent response code")
print(df["sc-status"].value_counts())

The script above is taking care of a few things that were important to me. I decided to load the logs of all websites on my IIS server. To perform such operation, I had to traverse through the directories storing log files. As you can see in the code, I decided to skip FTP logs. They are using a different format and storing different information, because of this I parse them separately.

Once I’m in the directory, I’m retrieving the list of log files and chose only the newest one. In most cases we are reviewing logs, this is the most useful one. Sometimes we have to review the more extended period, in such case, I import all files or declare different rules.

The file is then imported using pandas library, which I find useful for datasets manipulation. Comment lines are skipped, and the data is stored in the pandas dataframe. As the last step of the particular file import, I add the website identifier (taken from the subdirectory name) to the imported records. This way I can identify which website given record came from.

Each partial import is added to the main dataframe which I use for further data analysis. Starting with such dataframe, the possibilities are endless. As an example, I retrieve the list of IPs, User Agents, websites and status codes found in the logs. They are sorted with the most frequent first, so I can quickly locate possible issues.