Processing Apache and Nginx Access Logs
Tools such as AWStats and Logstalgia are great, but sometimes they can be over-kill for the problem we are trying to solve. It turns out that with a couple of simple Unix commands we can gather a lot of useful information from the data stored in our the access logs. Both Apache and Nginx by default use the combined log format which I will be basing this posts examples on. Below are two different methods of accessing either an uncompressed single file or multiple compressed files (following the supplied wild-card pattern).
Log Data to Information
404 Request Responses
Using ‘awk’ we are able to filter the access log entires down to only the ones that include the 404 status code. In this case we are then displaying the most requested pages of this type, in descending order.
Request User Agents
The command above displays the top 25 user agents (browser, operating system) that have requested a resource from the web server. The ‘awk’ command in this instance uses a defined field separator of “ to successfully parse the user agent string.
Request IP Addresses
Above are two examples of displaying the top 25 IP addresses based on total requests. The second command uses the GeoIP package to include the country the IP address originates from. This package can be installed on a CentOS setup using the EPEL repository.
Count Unique Visits
The commands above will provide you with a total count of unique visits based on IP address. You are also able to run the log file through the ‘grep’ command before processing, to only include entries that have occurred today or this month.
Ranked by Response Codes
This simple command is very useful to quickly observe the total counts based on returned response code.
Most Popular URLS
A trivial replacement for some Google Analytics statistics, reporting how many hits the top 25 resources have tallied.
Real-time IP-Page Requests
The final two commands are most likely my favorite as they provide me with real-time access information. These commands report on each IP address, request and response that have recently occurred on the server. Using tailf instead of a typical ‘tail -f’ has the benefit of not accessing the file when it is not growing.