Business & Technology

Optimizing Your Network Design with the NP6 Platform

By Glen Kemp | September 01, 2014

With the recent announcement of the new FortiGate 300D, 500D, and new blades for the 5000 series chassis, Fortinet continues our roll-out of the NP6 processor across the FortiGate portfolio. The NP6 processor is the "beating heart" of the FortiGate firewall; like something from science fiction, there is often more than one of them. While the headline performance figures from the NP6 are impressive, a good design and pragmatic implementation are the keys to a successful implementation.

What Makes the NP6 Special or Important?

The demands placed upon a modern firewall are diverse. Different processors are better suited to handle different types of traffic, with some tasks such as anti-virus scanning best handled by a general-purpose processor. Threats and counter-measures are constantly evolving, requiring constant tweaks to the engine. Conversely, multicast protocols can generate large volumes of data, but computationally speaking only a limited number of operations need to be supported - this makes them ideal for optimization in hardware.

With this in mind, the FortiGate is considered an ASIC firewall - an Application Specific Integrated Circuit (ASIC) performs the heavy lifting of network throughput. However, the NP6 ASIC is only half of the picture. There are in fact two distinct ASIC designs in most FortiGate firewalls:

  • A Network Processor (NP) that accelerates IPv4, IPv6, IPsec encryption, and multicast traffic. The current generation is the NP6 and replaces the widely-deployed NP4.
  • A Content Processor (CP) that offloads a variety of CPU-intensive security services used by the IPS, bulk crypto, and authentication engines. The current generation CP8 is deployed in many Fortinet products.

In addition to the ASICs, most devices also carry a general purpose processor, with multiple, and sometimes many, cores. For devices where low power consumption and smaller footprint is a bigger consideration than throughput, the entry-level and mid-range appliances contain either a “System on a Chip” (SoC) or a NP4Lite (a “diet” version of the NP4 processor), usually in conjunction with content processor (CP) and/or a general purpose processor. However, the recently released 300D, 500D, and new 5000 series blades join the existing 1500D and 3700D series firewalls with the “perfect trifecta” of the NP6, CP8, and conventional CPU.

For the appliances such as the 1500D and 3700D that house multiple NP6 processors, some consideration for the physical characteristics must be given. Each NP6 has a maximum throughput of 40Gbps and 10 million sessions. It’s easy to understand then how the 1500D platform can achieve 80Gbps and the 3700D 160Gbps. Once a session has been established through the firewall, the CPU offloads the session to the NPU. Once the session is installed, all subsequent packets in the flow are directed into the NPU. This is the fast-path technique used in various flavors by many firewall vendors. Once a session is in the fast-path, the general-purpose CPU is free for tasks more complex than simple traffic forwarding. Some flows will be directed into the CPU and the content processors. It is relatively straightforward to balance sessions across the processors to ensure the optimal usage of the available resources. It is however, more difficult to balance ingress traffic across the network processors as they are directly attached to the internal switch fabric. Consider the network interfaces on the firewall to be lanes in on a motorway/freeway/autobahn approaching a major city. Each city has a fixed number of lanes with congestion management. While it’s possible to manage the traffic within the lanes to a point, if traffic isn’t evenly distributed across the lanes and the cities, one can become congested while the other idles. If an NPU becomes saturated (for example, four 2x10GBs 802.11ad Aggregated Links (LAGs) attached to a single NPU), then new sessions could not be installed on the NPU and would instead be serviced by the CPU. The firewall would still function as it should, but inevitably the performance would be suboptimal.

VDOMs and LAGs

Another important operational consideration are VDOM and LAG configurations. Many customers choose to use VDOMs as an easy (and free!) method of virtualizing traffic on their firewalls. A common deployment method is to use one VDOM per application or tenant. For traffic transiting the firewall it is better to arrive and depart from the same NPU, for the same reason in air travel you usually want to return to the same city that you departed from - to avoid an additional “hop” between cities/NPUs. Obviously there are circumstances where you would want to do this; hardware-accelerated Inter-VDOM links are provided expressly for this purpose. However, for maximum performance north-south traffic should cross the same NPU. For cases where LAGs are used, for optimal hardware acceleration, the interface pairs should be sequential. For example, on a LAG1, port1 and port2 should be used. For LAG2, ports 3 and 4 should be used. You get the idea.

The below diagram shows a very stylized example of a typical (well at least to me) deployment.

VDOM Design

This particular scenario shows VDOM-Live and VDOM-Test spread across two NPUs. This is not directly configured on the firewall, but by implication. All network interfaces and LAGs associated with each VDOM are connected to the same NPU. Each LAG is associated with at least one zone (GI-DMZ, Mobile Zone, Untrust) and the client VLANs are then tiered from each zone. In this case, hardware inter-VDOM links are configured to provide an accelerated path for east-west traffic between the VDOMs, and by extension, the NPUs. The root VDOM is only connected to CPU bound interfaces, but doesn't actually perform any significant processing in this context.

How do I find out which port is connected to which NPU?

There are two methods of finding this out. Using the Command Line Interface (CLI), just run the following commands:

    config global
    diagnose npu np6 port-list

And you will be shown the interface connection table for each NPU:

FOOFIREWALL2 (global) # diag npu np6 port-list
Chip   XAUI Ports   Max   Cross-chip
                    Speed offloading
------ ---- ------- ----- ----------
np6_0  0    port26  10G   Yes
       1    port25  10G   Yes
       2    port28  10G   Yes
       3    port27  10G   Yes
       0-3  port1   40G   Yes
------ ---- ------- ----- ----------
np6_1  0    port30  10G   Yes
       1    port29  10G   Yes
       2    port32  10G   Yes
       3    port31  10G   Yes
       0-3  port3   40G   Yes
------ ---- ------- ----- ----------
np6_2  0    port5   10G   Yes
       0    port9   10G   Yes
       0    port13  10G   Yes
       1    port6   10G   Yes
       1    port10  10G   Yes
       1    port14  10G   Yes
       2    port7   10G   Yes
       2    port11  10G   Yes
       3    port8   10G   Yes
       3    port12  10G   Yes
       0-3  port2   40G   Yes
------ ---- ------- ----- ----------
np6_3  0    port15  10G   Yes
       0    port19  10G   Yes    
       0    port23  10G   Yes
       1    port16  10G   Yes
       1    port20  10G   Yes
       1    port24  10G   Yes
       2    port17  10G   Yes
       2    port21  10G   Yes
       3    port18  10G   Yes
       3    port22  10G   Yes
       0-3  port4   40G   Yes

Alternatively, do I what I do, open the FortiOS Handbook: Hardware Acceleration Guide and look for the diagram with the tiny port numbers written above the colour-coded physical interfaces:

Fortinet 1500D NPU Schematic

Power Controls

It is often said that power is useless without control. The FortiGate platform and the NP6 processor have it in spades: the control comes at the design stage when you are mapping out physical interfaces to zones, and in turn VDOMs and NPUs. The exceptional performance of the FortiGate firewall comes from its hybrid design of conventional CPUs and ASICs. Optimal performance management is simply a case of understanding how traffic enters the firewall, and how it leaves it. In the same way one is not expected to understand the distinct mechanics of a supercar, one does need to understand what the loud pedal is for and where gas goes.

Join the Discussion