Epoll tutorial and understanding the overlooked epoll_data_t

Epoll is a mechanism of Linux Kernel / Linux C runtime to monitor multiple file descriptors for I/O. Lets say you have a server program which has open connections as file descriptors. Epoll mechanism offers more performant “watching” of all these file descriptors for activity compared to some other options, like select(). Some tutorials about epoll exist on the internet, but many fail to acknowledge WHY what is happening is actually happening. So allow me to walk you through with this small example program utilizing epoll.

The example program

The example program is here as Github gist:

https://gist.github.com/usvi/556906ef7764a6d8b2bda5853e3a06e2

Please note that the program is only a crude demonstration to get the basic concepts incepted. There is no proper accounting, error handling and deinitialization, which should be present in actual programs. We will be looking mostly into 2 epoll functions, namely epoll_ctl() and epoll_wait().

The test program is also presented below for easier access:

#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <semaphore.h>
#include <pthread.h>
#include <sys/epoll.h>

// gcc -Wall epoll_test.c -o epoll_test -lpthread
// ./epoll_test

#define PIPE_READ (0)
#define PIPE_WRITE (1)
#define MAP_TO_FD (123)

sem_t gx_main_sem;
int gai_pipe[2];


static void* pvThreadBody(void* pv_arg_data)
{
  int i_epoll_fd = -1;
  int i_wait_ret = -1;
  struct epoll_event x_epoll_temp_event;

  i_epoll_fd = epoll_create1(0);
  memset(&x_epoll_temp_event, 0, sizeof(struct epoll_event));
  x_epoll_temp_event.events = EPOLLIN;
  x_epoll_temp_event.data.fd = MAP_TO_FD;
  printf("Doing epoll_ctl with arg fd=%d and event data fd=%d\n",
    gai_pipe[PIPE_READ], x_epoll_temp_event.data.fd);
  epoll_ctl(i_epoll_fd, EPOLL_CTL_ADD, gai_pipe[PIPE_READ], &x_epoll_temp_event);

  memset(&x_epoll_temp_event, 0, sizeof(struct epoll_event));
  sem_post(&gx_main_sem);
  i_wait_ret = epoll_wait(i_epoll_fd, &x_epoll_temp_event, 1, -1);

  printf("Epoll wait result %d, event data fd=%d\n",
    i_wait_ret, x_epoll_temp_event.data.fd);

  return NULL;
}


int main()
{
  pthread_t x_evl_thread;
  sem_init(&gx_main_sem, 0, 0);
  pipe(gai_pipe);
  pthread_create(&x_evl_thread, NULL, pvThreadBody, NULL);
  sem_wait(&gx_main_sem);
  printf("Writing to pipe, write fd=%d, read fd=%d\n",
    gai_pipe[PIPE_WRITE], gai_pipe[PIPE_READ]);
  write(gai_pipe[PIPE_WRITE], "Hello\n", strlen("Hello\n"));

  pthread_join(x_evl_thread, NULL);

  return 0;
}

This is essentially what the program does:

  1. Creates a pipe which one end readable and other end writable
  2. Creates a thread to receive pipe data
  3. Sets up  epoll() mechanism in the thread and waits for data
  4. Writes to the pipe, which gets monitored in the thread epoll mechanism

Main function is not that interesting, but is nonetheless described in the following picture

The thread body function is more interesting. It is described below

Running the test program yields the following output:

Doing epoll_ctl with arg fd=3 and event data fd=123
Writing to pipe, write fd=4, read fd=3
Epoll wait result 1, event data fd=123

But wait! the pipe file descriptor is 3, why are we seeing 123 as a result!? It is because we told it so. When we were adding the file descriptor gai_pipe[PIPE_READ], we passed in struct epoll_event as parameter also. Definition for it is:

typedef union epoll_data {
  void *ptr;
  int fd;
  uint32_t u32;
  uint64_t u64;
} epoll_data_t;

struct epoll_event {
  uint32_t events; /* Epoll events */
  epoll_data_t data; /* User data variable */
};

So, we told in the uint32_t events  to use EPOLLIN , aka get notification if the file descriptor is readable. epoll_event also passes a field epoll_data_t data. And exactly in this epoll_data_t we set up the arbitrary file descriptor to symbol MAP_TO_FD , or 123. Epoll did just we asked for. We asked it to return 123 in user data file descriptor whenever it was noticing activity in the monitored file descriptor 3 of the pipe.

As we can see,  we can set basically any of the epoll_data_t fields to a value of our liking. We could even set the void pointer ptr to arbitrary custom data and retrieve it later in a matching handler if we knew the type is always what the handler expects. This is a very convenient mechanism in some contexts.

Sidenote: epoll_ctl() and EPOLL_CTL_ADD vs. EPOLL_CTL_MOD

There is one additional thing causing confusion in some usage scenarios with epoll_ctl(). Lets say you added a file descriptor via epoll_ctl with the EPOLL_CTL_ADD opcode, but did not specify any events. The result is now that the epoll set contains the file descriptor but it does not do anything in epoll_wait. Now the confusing thing is, if you are trying to add the same file descriptor and actually describing something as struct epoll_event .events, the operation will fail!

Why is this? The file descriptor is already there and cannot be added twice. You need to use the EPOLL_CTL_MOD opcode. There is no practical possibility to extract the actual file descriptors already in the set.

The implicit side effect of this is that you need to have your separate accounting of the state information of actual descriptors in the set, so you know when to use EPOLL_CTL_ADD and when to use EPOLL_CTL_MOD. And of course what .events to instruct the epoll to watch for that file descriptor.

Closing words

Thats all. I hope you got a grip of what epoll is and how it is used. And be sure to use proper error handling, accounting and teardown in your own programs utilizing epoll.

Janne Paalijarvi
2023-08-13 Korsholm

Leave a Reply

Your email address will not be published. Required fields are marked *