Loading...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 | The (New) Linux Kernel Driver Model
Version 0.04
Patrick Mochel <mochel@osdl.org>
03 December 2001
Overview
~~~~~~~~
This driver model is a unification of all the current, disparate driver models
that are currently in the kernel. It is intended is to augment the
bus-specific drivers for bridges and devices by consolidating a set of data
and operations into globally accessible data structures.
Current driver models implement some sort of tree-like structure (sometimes
just a list) for the devices they control. But, there is no linkage between
the different bus types.
A common data structure can provide this linkage with little overhead: when a
bus driver discovers a particular device, it can insert it into the global
tree as well as its local tree. In fact, the local tree becomes just a subset
of the global tree.
Common data fields can also be moved out of the local bus models into the
global model. Some of the manipulation of these fields can also be
consolidated. Most likely, manipulation functions will become a set
of helper functions, which the bus drivers wrap around to include any
bus-specific items.
The common device and bridge interface currently reflects the goals of the
modern PC: namely the ability to do seamless Plug and Play, power management,
and hot plug. (The model dictated by Intel and Microsoft (read: ACPI) ensures
us that any device in the system may fit any of these criteria.)
In reality, not every bus will be able to support such operations. But, most
buses will support a majority of those operations, and all future buses will.
In other words, a bus that doesn't support an operation is the exception,
instead of the other way around.
Drivers
~~~~~~~
The callbacks for bridges and devices are intended to be singular for a
particular type of bus. For each type of bus that has support compiled in the
kernel, there should be one statically allocated structure with the
appropriate callbacks that each device (or bridge) of that type share.
Each bus layer should implement the callbacks for these drivers. It then
forwards the calls on to the device-specific callbacks. This means that
device-specific drivers must still implement callbacks for each operation.
But, they are not called from the top level driver layer.
This does add another layer of indirection for calling one of these functions,
but there are benefits that are believed to outweigh this slowdown.
First, it prevents device-specific drivers from having to know about the
global device layer. This speeds up integration time incredibly. It also
allows drivers to be more portable across kernel versions. Note that the
former was intentional, the latter is an added bonus.
Second, this added indirection allows the bus to perform any additional logic
necessary for its child devices. A bus layer may add additional information to
the call, or translate it into something meaningful for its children.
This could be done in the driver, but if it happens for every object of a
particular type, it is best done at a higher level.
Recap
~~~~~
Instances of devices and bridges are allocated dynamically as the system
discovers their existence. Their fields describe the individual object.
Drivers - in the global sense - are statically allocated and singular for a
particular type of bus. They describe a set of operations that every type of
bus could implement, the implementation following the bus's semantics.
Downstream Access
~~~~~~~~~~~~~~~~~
Common data fields have been moved out of individual bus layers into a common
data structure. But, these fields must still be accessed by the bus layers,
and sometimes by the device-specific drivers.
Other bus layers are encouraged to do what has been done for the PCI layer.
struct pci_dev now looks like this:
struct pci_dev {
...
struct device device;
};
Note first that it is statically allocated. This means only one allocation on
device discovery. Note also that it is at the _end_ of struct pci_dev. This is
to make people think about what they're doing when switching between the bus
driver and the global driver; and to prevent against mindless casts between
the two.
The PCI bus layer freely accesses the fields of struct device. It knows about
the structure of struct pci_dev, and it should know the structure of struct
device. PCI devices that have been converted generally do not touch the fields
of struct device. More precisely, device-specific drivers should not touch
fields of struct device unless there is a strong compelling reason to do so.
This abstraction is prevention of unnecessary pain during transitional phases.
If the name of the field changes or is removed, then every downstream driver
will break. On the other hand, if only the bus layer (and not the device
layer) accesses struct device, it is only those that need to change.
User Interface
~~~~~~~~~~~~~~
By virtue of having a complete hierarchical view of all the devices in the
system, exporting a complete hierarchical view to userspace becomes relatively
easy. This has been accomplished by implementing a special purpose virtual
file system named driverfs. It is hence possible for the user to mount the
whole driverfs on a particular mount point in the unified UNIX file hierarchy.
This can be done permanently by providing the following entry into the
/dev/fstab (under the provision that the mount point does exist, of course):
none /devices driverfs defaults 0 0
Or by hand on the command line:
~: mount -t driverfs none /devices
Whenever a device is inserted into the tree, a directory is created for it.
This directory may be populated at each layer of discovery - the global layer,
the bus layer, or the device layer.
The global layer currently creates two files - 'status' and 'power'. The
former only reports the name of the device and its bus ID. The latter reports
the current power state of the device. It also be used to set the current
power state.
The bus layer may also create files for the devices it finds while probing the
bus. For example, the PCI layer currently creates 'wake' and 'resource' files
for each PCI device.
A device-specific driver may also export files in its directory to expose
device-specific data or tunable interfaces.
These features were initially implemented using procfs. However, after one
conversation with Linus, a new filesystem - driverfs - was created to
implement these features. It is an in-memory filesystem, based heavily off of
ramfs, though it uses procfs as inspiration for its callback functionality.
Each struct device has a 'struct driver_dir_entry' which encapsulates the
device's directory and the files within.
Device Structures
~~~~~~~~~~~~~~~~~
struct device {
struct list_head bus_list;
struct iobus *parent;
struct iobus *subordinate;
char name[DEVICE_NAME_SIZE];
char bus_id[BUS_ID_SIZE];
struct driver_dir_entry * dir;
spinlock_t lock;
atomic_t refcount;
struct device_driver *driver;
void *driver_data;
void *platform_data;
u32 current_state;
unsigned char *saved_state;
};
bus_list:
List of all devices on a particular bus; i.e. the device's siblings
parent:
The parent bridge for the device.
subordinate:
If the device is a bridge itself, this points to the struct io_bus that is
created for it.
name:
Human readable (descriptive) name of device. E.g. "Intel EEPro 100"
bus_id:
Parsable (yet ASCII) bus id. E.g. "00:04.00" (PCI Bus 0, Device 4, Function
0). It is necessary to have a searchable bus id for each device; making it
ASCII allows us to use it for its directory name without translating it.
dir:
Driver's driverfs directory.
lock:
Driver specific lock.
refcount:
Driver's usage count.
When this goes to 0, the device is assumed to be removed. It will be removed
from its parent's list of children. It's remove() callback will be called to
inform the driver to clean up after itself.
driver:
Pointer to a struct device_driver, the common operations for each device. See
next section.
driver_data:
Private data for the driver.
Much like the PCI implementation of this field, this allows device-specific
drivers to keep a pointer to a device-specific data.
platform_data:
Data that the platform (firmware) provides about the device.
For example, the ACPI BIOS or EFI may have additional information about the
device that is not directly mappable to any existing kernel data structure.
It also allows the platform driver (e.g. ACPI) to a driver without the driver
having to have explicit knowledge of (atrocities like) ACPI.
current_state:
Current power state of the device. For PCI and other modern devices, this is
0-3, though it's not necessarily limited to those values.
saved_state:
Pointer to driver-specific set of saved state.
Having it here allows modules to be unloaded on system suspend and reloaded
on resume and maintain state across transitions.
It also allows generic drivers to maintain state across system state
transitions.
(I've implemented a generic PCI driver for devices that don't have a
device-specific driver. Instead of managing some vector of saved state
for each device the generic driver supports, it can simply store it here.)
struct device_driver {
int (*probe) (struct device *dev);
int (*remove) (struct device *dev);
int (*suspend) (struct device *dev, u32 state, u32 level);
int (*resume) (struct device *dev, u32 level);
}
probe:
Check for device existence and associate driver with it.
remove:
Dissociate driver with device. Releases device so that it could be used by
another driver. Also, if it is a hotplug device (hotplug PCI, Cardbus), an
ejection event could take place here.
suspend:
Perform one step of the device suspend process.
resume:
Perform one step of the device resume process.
The probe() and remove() callbacks are intended to be much simpler than the
current PCI correspondents.
probe() should do the following only:
- Check if hardware is present
- Register device interface
- Disable DMA/interrupts, etc, just in case.
Some device initialisation was done in probe(). This should not be the case
anymore. All initialisation should take place in the open() call for the
device.
Breaking initialisation code out must also be done for the resume() callback,
as most devices will have to be completely reinitialised when coming back from
a suspend state.
remove() should simply unregister the device interface.
Device power management can be quite complicated, based exactly what is
desired to be done. Four operations sum up most of it:
- OS directed power management.
The OS takes care of notifying all drivers that a suspend is requested,
saving device state, and powering devices down.
- Firmware controlled power management.
The OS only wants to notify devices that a suspend is requested.
- Device power management.
A user wants to place only one device in a low power state, and maybe save
state.
- System reboot.
The system wants to place devices in a quiescent state before the system is
reset.
In an attempt to please all of these scenarios, the power management
transition for any device is broken up into several stages - notify, save
state, and power down. The disable stage, which should happen after notify and
before save state has been considered and may be implemented in the future.
Depending on what the system-wide policy is (usually dictated by the power
management scheme present), each driver's suspend callback may be called
multiple times, each with a different stage.
On all power management transitions, the stages should be called sequentially
(notify before save state; save state before power down). However, drivers
should not assume that any stage was called before hand. (If a driver gets a
power down call, it shouldn't assume notify or save state was called first.)
This allows the framework to be used seamlessly by all power management
actions. Hopefully.
Resume transitions happen in a similar manner. They are broken up into two
stages currently (power on and restore state), though a third stage (enable)
may be added later.
For suspend and resume transitions, the following values are defined to denote
the stage:
enum{
SUSPEND_NOTIFY,
SUSPEND_SAVE_STATE,
SUSPEND_POWER_DOWN,
};
enum {
RESUME_POWER_ON,
RESUME_RESTORE_STATE,
};
During a system power transition, the device tree must be walked in order,
calling the suspend() or resume() callback for each node. This may happen
several times.
Initially, this was done in kernel space. However, it has occurred to me that
doing recursion to a non-bounded depth is dangerous, and that there are a lot
of inherent race conditions in such an operation.
Non-recursive walking of the device tree is possible. However, this makes for
convoluted code.
No matter what, if the transition happens in kernel space, it is difficult to
gracefully recover from errors or to implement a policy that prevents one from
shutting down the device(s) you want to save state to.
Instead, the walking of the device tree has been moved to userspace. When a
user requests the system to suspend, it will walk the device tree, as exported
via driverfs, and tell each device to go to sleep. It will do this multiple
times based on what the system policy is.
[ FIXME: URL pointer to the corresponding utility is missing here! ]
Device resume should happen in the same manner when the system awakens.
Each suspend stage is described below:
SUSPEND_NOTIFY:
This level to notify the driver that it is going to sleep. If it knows that it
cannot resume the hardware from the requested level, or it feels that it is
too important to be put to sleep, it should return an error from this function.
It does not have to stop I/O requests or actually save state at this point.
SUSPEND_DISABLE:
The driver should stop taking I/O requests at this stage. Because the save
state stage happens afterwards, the driver may not want to physically disable
the device; only mark itself unavailable if possible.
SUSPEND_SAVE_STATE:
The driver should allocate memory and save any device state that is relevant
for the state it is going to enter.
SUSPEND_POWER_DOWN:
The driver should place the device in the power state requested.
For resume, the stages are defined as follows:
RESUME_POWER_ON:
Devices should be powered on and reinitialised to some known working state.
RESUME_RESTORE_STATE:
The driver should restore device state to its pre-suspend state and free any
memory allocated for its saved state.
RESUME_ENABLE:
The device should start taking I/O requests again.
Each driver does not have to implement each stage. But, it if it does
implemente a stage, it should do what is described above. It should not assume
that it performed any stage previously, or that it will perform any stage
later.
It is quite possible that a driver can fail during the suspend process, for
whatever reason. In this event, the calling process must gracefully recover
and restore everything to their states before the suspend transition began.
If a driver knows that it cannot suspend or resume properly, it should fail
during the notify stage. Properly implemented power management schemes should
make sure that this is the first stage that is called.
If a driver gets a power down request, it should obey it, as it may very
likely be during a reboot.
Bus Structures
~~~~~~~~~~~~~~
struct iobus {
struct list_head node;
struct iobus *parent;
struct list_head children;
struct list_head devices;
struct list_head bus_list;
spinlock_t lock;
atomic_t refcount;
struct device *self;
struct driver_dir_entry * dir;
char name[DEVICE_NAME_SIZE];
char bus_id[BUS_ID_SIZE];
struct bus_driver *driver;
};
node:
Bus's node in sibling list (its parent's list of child buses).
parent:
Pointer to parent bridge.
children:
List of subordinate buses.
In the children, this correlates to their 'node' field.
devices:
List of devices on the bus this bridge controls.
This field corresponds to the 'bus_list' field in each child device.
bus_list:
Each type of bus keeps a list of all bridges that it finds. This is the
bridges entry in that list.
self:
Pointer to the struct device for this bridge.
lock:
Lock for the bus.
refcount:
Usage count for the bus.
dir:
Driverfs directory.
name:
Human readable ASCII name of bus.
bus_id:
Machine readable (though ASCII) description of position on parent bus.
driver:
Pointer to operations for bus.
struct iobus_driver {
char name[16];
struct list_head node;
int (*scan) (struct io_bus*);
int (*add_device) (struct io_bus*, char*);
};
name:
ASCII name of bus.
node:
List of buses of this type in system.
scan:
Search the bus for new devices. This may happen either at boot - where every
device discovered will be new - or later on - in which there may only be a few
(or no) new devices.
add_device:
Trigger a device insertion at a particular location.
The API
~~~~~~~
There are several functions exported by the global device layer, including
several optional helper functions, written solely to try and make your life
easier.
void device_init_dev(struct device * dev);
Initialise a device structure. It first zeros the device, the initialises all
of the lists. (Note that this would have been called device_init(), but that
name was already taken. :/)
struct device * device_alloc(void)
Allocate memory for a device structure and initialise it.
First, allocates memory, then calls device_init_dev() with the new pointer.
int device_register(struct device * dev);
Register a device with the global device layer.
The bus layer should call this function upon device discovery, e.g. when
probing the bus.
dev should be fully initialised when this is called.
If dev->parent is not set, it sets its parent to be the device root.
It then does the following:
- inserts it into its parent's list of children
- creates a driverfs directory for it
- creates a set of default files for the device in its directory
- calls platform_notify() to notify the firmware driver of its existence.
void get_device(struct device * dev);
Increment the refcount for a device.
int valid_device(struct device * dev);
Check if reference count is positive for a device (it's not waiting to be
freed). If it is positive, it increments the reference count for the device.
It returns whether or not the device is usable.
void put_device(struct device * dev);
Decrement the reference count for the device. If it hits 0, it removes the
device from its parent's list of children and calls the remove() callback for
the device.
void lock_device(struct device * dev);
Take the spinlock for the device.
void unlock_device(struct device * dev);
Release the spinlock for the device.
void iobus_init(struct iobus * iobus);
struct iobus * iobus_alloc(void);
int iobus_register(struct iobus * iobus);
void get_iobus(struct iobus * iobus);
int valid_iobus(struct iobus * iobus);
void put_iobus(struct iobus * iobus);
void lock_iobus(struct iobus * iobus);
void unlock_iobus(struct iobus * iobus);
These functions provide the same functionality as the device_*
counterparts, only operating on a struct iobus. One important thing to note,
though is that iobus_register() and iobus_unregister() operate recursively. It
is possible to add an entire tree in one call.
int device_driver_init(void);
Main initialisation routine.
This makes sure driverfs is up and running and initialises the device tree.
void device_driver_exit(void);
This frees up the device tree.
Credits
~~~~~~~
The following people have been extremely helpful in solidifying this document
and the driver model.
Randy Dunlap rddunlap@osdl.org
Jeff Garzik jgarzik@mandrakesoft.com
Ben Herrenschmidt benh@kernel.crashing.org
|