Traffic microstructures and network anomaly detection
Much hope has been put in the modelling of network traffic with machine learning methods to detect previously unseen attacks. Many methods rely on features on a microscopic level such as packet sizes or interarrival times to identify reoccurring patterns and detect deviations from them. However, the success of these methods depends both on the quality of corresponding training and evaluation data as well as the understanding of the structures that methods learn. Currently, the academic community is lacking both, with widely used synthetic datasets facing serious problems and the disconnect between methods and data being named the "semantic gap". This thesis provides extensive examinations of the necessary requirements on traffic generation and microscopic traffic structures to enable the effective training and improvement of anomaly detection models. We first present and examine DetGen, a container-based traffic generation paradigm that enables precise control and ground truth information over factors that shape traffic microstructures. The goal of DetGen is to provide researchers with extensive ground truth information and enable the generation of customisable datasets that provide realistic structural diversity. DetGen was designed according to four specific traffic requirements that dataset generation needs to fulfil to enable machine-learning models to learn accurate and generalisable traffic representations. Current network intrusion datasets fail to meet these requirements, which we believe is one of the reasons for the lacking success of anomaly-based detection methods. We demonstrate the significance of these requirements experimentally by examining how model performance decreases when these requirements are not met. We then focus on the control and information over traffic microstructures that DetGen provides, and the corresponding benefits when examining and improving model failures for overall model development. We use three metrics to demonstrate that DetGen is able to provide more control and isolation over the generated traffic. The ground truth information DetGen provides enables us to probe two state-of-the-art traffic classifiers for failures on certain traffic structures, and the corresponding fixes in the model design almost halve the number of misclassifications . Drawing on these results, we propose CBAM, an anomaly detection model that detects network access attacks through deviations from reoccurring flow sequence patterns. CBAM is inspired by the design of self-supervised language models, and improves the AUC of current state-of-the-art by up to 140%. By understanding why several flow sequence structures present difficulties to our model, we make targeted design decisions that improve on these difficulties and ultimately boost the performance of our model. Lastly, we examine how the control and adversarial perturbation of traffic microstructures can be used by an attacker to evade detection. We show that in a stepping-stone attack, an attacker can evade every current detection model by mimicking the patterns observed in streaming services.