Group01 Dataset Overview
LINK TO PROJECT GROUPS:
Please Click Here -> [1]
|
|
|
|
|
|
|
|
|
Contents
What is Honeypot?
In simple terms, Honeypot is a trap for network attacks, and it records the IP addresses of such attacks.
As described by Amazon Web Service (AWS)[2], a honeypot is a security mechanism intended to lure and deflect an attempted attack. AWS’s honeypot is a trap point that one can insert into website to detect inbound requests from content scrapers and bad bots. The IP addresses are recorded if a source accesses the honeypot.
Overview of the AWS Honeypot Cyberattack
The test dataset of AWS Honeypot Cyberattack is retrieved from Kaggle https://www.kaggle.com/casimian2000/aws-honeypot-attack-data/data
We use Tableau Prep to run an overview of the data before any analysis.
Analysis of Data fields
Field # | Dataset Field | Comments | Details |
---|---|---|---|
1 | datatime | Packet Arrival Date | From 03/03/2013, 09:53:00PM
To 07/24/2013 07:47:00AM 185k entries |
2 | host | Honeypot Server | 9 Categories:
|
3 | src | Packet Source | 70k different sources |
4 | proto | Packet Protocol Type | ICMP, TCP, UDP |
5 | type | Packet Type | 8 different types |
6 | spt | Source Port | 46k ports |
7 | dpt | Destination Port | 4k ports |
8 | srcstr | Source IP Address | 70k addresses |
9 | cc | Source Country Code | 178 countries |
10 | country | Source Country | 178 countries |
11 | locale | Source Location | 1k locations |
12 | localeabbr | Location Abbreviation | 614k entries Note the grouping is not effective, making this field redundant for reference |
13 | postalcode | Postal Code | 3k entries |
14 | Latitude | Source Latitude | 5k entries |
15 | Longitude | Source Longtitude | 5k entries |
16 | F16 | A dummy field for those with longitude > 20,000 | There are 6 types in this category
|
Acknowledgement:the interpretation of the datasets have been assisted with below reference.
https://emreovunc.com/projects/honeypots_data_analysis.pdf
https://www.kaggle.com/jonathanbouchet/aws-honeypot/notebook
What to analysis?
Without a doubt, the dataset requires data cleaning as the work proceeds. However, based on the analysis of the field, it is clear that
- The targets/destinations are 8 different servers (host)
- The attackers are from various sources around the world
- IP addresses
- Counties + cities
- Postcode + Geographic data
- Time log is available
We can run a few analyses
- Basic statistics of the data
- Advance visualisation of the data
- Animation of attacks showing “Origin Vs Destination” over the time log