<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/include/net/sch_generic.h, branch v4.4</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v4.4</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v4.4'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2015-12-03T19:59:05Z</updated>
<entry>
<title>net_sched: fix qdisc_tree_decrease_qlen() races</title>
<updated>2015-12-03T19:59:05Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2015-12-02T04:08:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4eaf3b84f2881c9c028f1d5e76c52ab575fe3a66'/>
<id>urn:sha1:4eaf3b84f2881c9c028f1d5e76c52ab575fe3a66</id>
<content type='text'>
qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
devices.

One problem is that it updates sch-&gt;q.qlen and sch-&gt;qstats.drops
on the mq/mqprio root qdisc, while it should not : Daniele
reported underflows errors :
[  681.774821] PAX: sch-&gt;q.qlen: 0 n: 1
[  681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
[  681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G           O    4.2.6.201511282239-1-grsec #1
[  681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
[  681.774956]  ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
[  681.774959]  ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
[  681.774960]  ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
[  681.774962] Call Trace:
[  681.774967]  [&lt;ffffffffa95d2810&gt;] dump_stack+0x4c/0x7f
[  681.774970]  [&lt;ffffffffa91a44f4&gt;] report_size_overflow+0x34/0x50
[  681.774972]  [&lt;ffffffffa94d17e2&gt;] qdisc_tree_decrease_qlen+0x152/0x160
[  681.774976]  [&lt;ffffffffc02694b1&gt;] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
[  681.774978]  [&lt;ffffffffc02680a0&gt;] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
[  681.774980]  [&lt;ffffffffa94cd92d&gt;] __qdisc_run+0x4d/0x1d0
[  681.774983]  [&lt;ffffffffa949b2b2&gt;] net_tx_action+0xc2/0x160
[  681.774985]  [&lt;ffffffffa90664c1&gt;] __do_softirq+0xf1/0x200
[  681.774987]  [&lt;ffffffffa90665ee&gt;] run_ksoftirqd+0x1e/0x30
[  681.774989]  [&lt;ffffffffa90896b0&gt;] smpboot_thread_fn+0x150/0x260
[  681.774991]  [&lt;ffffffffa9089560&gt;] ? sort_range+0x40/0x40
[  681.774992]  [&lt;ffffffffa9085fe4&gt;] kthread+0xe4/0x100
[  681.774994]  [&lt;ffffffffa9085f00&gt;] ? kthread_worker_fn+0x170/0x170
[  681.774995]  [&lt;ffffffffa95d8d1e&gt;] ret_from_fork+0x3e/0x70

mq/mqprio have their own ways to report qlen/drops by folding stats on
all their queues, with appropriate locking.

A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
without proper locking : concurrent qdisc updates could corrupt the list
that qdisc_match_from_root() parses to find a qdisc given its handle.

Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
qdisc_tree_decrease_qlen() can use to abort its tree traversal,
as soon as it meets a mq/mqprio qdisc children.

Second problem can be fixed by RCU protection.
Qdisc are already freed after RCU grace period, so qdisc_list_add() and
qdisc_list_del() simply have to use appropriate rcu list variants.

A future patch will add a per struct netdev_queue list anchor, so that
qdisc_tree_decrease_qlen() can have more efficient lookups.

Reported-by: Daniele Fucini &lt;dfucini@gmail.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Cong Wang &lt;cwang@twopensource.com&gt;
Cc: Jamal Hadi Salim &lt;jhs@mojatatu.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: add bpf_redirect() helper</title>
<updated>2015-09-18T04:09:07Z</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@plumgrid.com</email>
</author>
<published>2015-09-16T06:05:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=27b29f63058d26c6c1742f1993338280d5a41dc6'/>
<id>urn:sha1:27b29f63058d26c6c1742f1993338280d5a41dc6</id>
<content type='text'>
Existing bpf_clone_redirect() helper clones skb before redirecting
it to RX or TX of destination netdev.
Introduce bpf_redirect() helper that does that without cloning.

Benchmarked with two hosts using 10G ixgbe NICs.
One host is doing line rate pktgen.
Another host is configured as:
$ tc qdisc add dev $dev ingress
$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
   action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
so it receives the packet on $dev and immediately xmits it on $dev + 1
The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
that does bpf_clone_redirect() and performance is 2.0 Mpps

$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
   action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
which is using bpf_redirect() - 2.4 Mpps

and using cls_bpf with integrated actions as:
$ tc filter add dev $dev root pref 10 \
  bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
performance is 2.5 Mpps

To summarize:
u32+act_bpf using clone_redirect - 2.0 Mpps
u32+act_bpf using redirect - 2.4 Mpps
cls_bpf using redirect - 2.5 Mpps

For comparison linux bridge in this setup is doing 2.1 Mpps
and ixgbe rx + drop in ip_rcv - 7.8 Mpps

Signed-off-by: Alexei Starovoitov &lt;ast@plumgrid.com&gt;
Acked-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: John Fastabend &lt;john.r.fastabend@intel.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>cls_bpf: introduce integrated actions</title>
<updated>2015-09-18T04:09:06Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2015-09-16T06:05:42Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=045efa82ff563cd4e656ca1c2e354fa5bf6bbda4'/>
<id>urn:sha1:045efa82ff563cd4e656ca1c2e354fa5bf6bbda4</id>
<content type='text'>
Often cls_bpf classifier is used with single action drop attached.
Optimize this use case and let cls_bpf return both classid and action.
For backwards compatibility reasons enable this feature under
TCA_BPF_FLAG_ACT_DIRECT flag.

Then more interesting programs like the following are easier to write:
int cls_bpf_prog(struct __sk_buff *skb)
{
  /* classify arp, ip, ipv6 into different traffic classes
   * and drop all other packets
   */
  switch (skb-&gt;protocol) {
  case htons(ETH_P_ARP):
    skb-&gt;tc_classid = 1;
    break;
  case htons(ETH_P_IP):
    skb-&gt;tc_classid = 2;
    break;
  case htons(ETH_P_IPV6):
    skb-&gt;tc_classid = 3;
    break;
  default:
    return TC_ACT_SHOT;
  }

  return TC_ACT_OK;
}

Joint work with Daniel Borkmann.

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@plumgrid.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: sched: register noqueue qdisc</title>
<updated>2015-08-28T00:14:30Z</updated>
<author>
<name>Phil Sutter</name>
<email>phil@nwl.cc</email>
</author>
<published>2015-08-27T19:21:38Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d66d6c3152e8d5a6db42a56bf7ae1c6cae87ba48'/>
<id>urn:sha1:d66d6c3152e8d5a6db42a56bf7ae1c6cae87ba48</id>
<content type='text'>
This way users can attach noqueue just like any other qdisc using tc
without having to mess with tx_queue_len first.

Signed-off-by: Phil Sutter &lt;phil@nwl.cc&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: sched: extend percpu stats helpers</title>
<updated>2015-07-08T20:50:41Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2015-07-06T12:18:03Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=24ea591d2201c3257d666466e8fac50a6cf3c52f'/>
<id>urn:sha1:24ea591d2201c3257d666466e8fac50a6cf3c52f</id>
<content type='text'>
qdisc_bstats_update_cpu() and other helpers were added to support
percpu stats for qdisc.

We want to add percpu stats for tc action, so this patch add common
helpers.

qdisc_bstats_update_cpu() is renamed to qdisc_bstats_cpu_update()
qdisc_qstats_drop_cpu() is renamed to qdisc_qstats_cpu_drop()

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Alexei Starovoitov &lt;ast@plumgrid.com&gt;
Acked-by: Jamal Hadi Salim &lt;jhs@mojatatu.com&gt;
Acked-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Acked-by: Alexei Starovoitov &lt;ast@plumgrid.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: sched: use counter to break reclassify loops</title>
<updated>2015-05-13T19:08:14Z</updated>
<author>
<name>Florian Westphal</name>
<email>fw@strlen.de</email>
</author>
<published>2015-05-11T17:50:41Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e578d9c02587d57bfa7b560767c698a668a468c6'/>
<id>urn:sha1:e578d9c02587d57bfa7b560767c698a668a468c6</id>
<content type='text'>
Seems all we want here is to avoid endless 'goto reclassify' loop.
tc_classify_compat even resets this counter when something other
than TC_ACT_RECLASSIFY is returned, so this skb-counter doesn't
break hypothetical loops induced by something other than perpetual
TC_ACT_RECLASSIFY return values.

skb_act_clone is now identical to skb_clone, so just use that.

Tested with following (bogus) filter:
tc filter add dev eth0 parent ffff: \
 protocol ip u32 match u32 0 0 police rate 10Kbit burst \
 64000 mtu 1500 action reclassify

Acked-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: Florian Westphal &lt;fw@strlen.de&gt;
Acked-by: Alexei Starovoitov &lt;ast@plumgrid.com&gt;
Acked-by: Jamal Hadi Salim &lt;jhs@mojatatu.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: sched: deprecate enqueue_root()</title>
<updated>2015-05-11T18:17:32Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2015-05-11T16:06:56Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b396cca6fafccf16206a5d041d59c9e6b65b6f5a'/>
<id>urn:sha1:b396cca6fafccf16206a5d041d59c9e6b65b6f5a</id>
<content type='text'>
Only left enqueue_root() user is netem, and it looks not necessary :

qdisc_skb_cb(skb)-&gt;pkt_len is preserved after one skb_clone()

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: sched: remove TC_MUNGED bits</title>
<updated>2015-05-03T02:25:17Z</updated>
<author>
<name>Florian Westphal</name>
<email>fw@strlen.de</email>
</author>
<published>2015-04-30T10:12:00Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4749c3ef854e3a5d3dd3cc0ccd2dcb7e05d583bd'/>
<id>urn:sha1:4749c3ef854e3a5d3dd3cc0ccd2dcb7e05d583bd</id>
<content type='text'>
Not used.

pedit sets TC_MUNGED when packet content was altered, but all the core
does is unset MUNGED again and then set OK2MUNGE.

And the latter isn't tested anywhere. So lets remove both
TC_MUNGED and TC_OK2MUNGE.

Signed-off-by: Florian Westphal &lt;fw@strlen.de&gt;
Acked-by: Alexei Starovoitov &lt;ast@plumgrid.com&gt;
Acked-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Jamal Hadi Salim &lt;jhs@mojatatu.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net_sched: destroy proto tp when all filters are gone</title>
<updated>2015-03-09T19:35:55Z</updated>
<author>
<name>Cong Wang</name>
<email>cwang@twopensource.com</email>
</author>
<published>2015-03-06T19:47:59Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1e052be69d045c8d0f82ff1116fd3e5a79661745'/>
<id>urn:sha1:1e052be69d045c8d0f82ff1116fd3e5a79661745</id>
<content type='text'>
Kernel automatically creates a tp for each
(kind, protocol, priority) tuple, which has handle 0,
when we add a new filter, but it still is left there
after we remove our own, unless we don't specify the
handle (literally means all the filters under
the tuple). For example this one is left:

  # tc filter show dev eth0
  filter parent 8001: protocol arp pref 49152 basic

The user-space is hard to clean up these for kernel
because filters like u32 are organized in a complex way.
So kernel is responsible to remove it after all filters
are gone.  Each type of filter has its own way to
store the filters, so each type has to provide its
way to check if all filters are gone.

Cc: Jamal Hadi Salim &lt;jhs@mojatatu.com&gt;
Signed-off-by: Cong Wang &lt;cwang@twopensource.com&gt;
Signed-off-by: Cong Wang &lt;xiyou.wangcong@gmail.com&gt;
Acked-by: Jamal Hadi Salim&lt;jhs@mojatatu.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: sched: fix panic in rate estimators</title>
<updated>2015-02-01T01:49:37Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2015-01-30T01:30:12Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0d32ef8cef9aa8f375e128f78b77caceaa7e8da0'/>
<id>urn:sha1:0d32ef8cef9aa8f375e128f78b77caceaa7e8da0</id>
<content type='text'>
Doing the following commands on a non idle network device
panics the box instantly, because cpu_bstats gets overwritten
by stats.

tc qdisc add dev eth0 root &lt;your_favorite_qdisc&gt;
... some traffic (one packet is enough) ...
tc qdisc replace dev eth0 root est 1sec 4sec &lt;your_favorite_qdisc&gt;

[  325.355596] BUG: unable to handle kernel paging request at ffff8841dc5a074c
[  325.362609] IP: [&lt;ffffffff81541c9e&gt;] __gnet_stats_copy_basic+0x3e/0x90
[  325.369158] PGD 1fa7067 PUD 0
[  325.372254] Oops: 0000 [#1] SMP
[  325.375514] Modules linked in: ...
[  325.398346] CPU: 13 PID: 14313 Comm: tc Not tainted 3.19.0-smp-DEV #1163
[  325.412042] task: ffff8800793ab5d0 ti: ffff881ff2fa4000 task.ti: ffff881ff2fa4000
[  325.419518] RIP: 0010:[&lt;ffffffff81541c9e&gt;]  [&lt;ffffffff81541c9e&gt;] __gnet_stats_copy_basic+0x3e/0x90
[  325.428506] RSP: 0018:ffff881ff2fa7928  EFLAGS: 00010286
[  325.433824] RAX: 000000000000000c RBX: ffff881ff2fa796c RCX: 000000000000000c
[  325.440988] RDX: ffff8841dc5a0744 RSI: 0000000000000060 RDI: 0000000000000060
[  325.448120] RBP: ffff881ff2fa7948 R08: ffffffff81cd4f80 R09: 0000000000000000
[  325.455268] R10: ffff883ff223e400 R11: 0000000000000000 R12: 000000015cba0744
[  325.462405] R13: ffffffff81cd4f80 R14: ffff883ff223e460 R15: ffff883feea0722c
[  325.469536] FS:  00007f2ee30fa700(0000) GS:ffff88407fa20000(0000) knlGS:0000000000000000
[  325.477630] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  325.483380] CR2: ffff8841dc5a074c CR3: 0000003feeae9000 CR4: 00000000001407e0
[  325.490510] Stack:
[  325.492524]  ffff883feea0722c ffff883fef719dc0 ffff883feea0722c ffff883ff223e4a0
[  325.499990]  ffff881ff2fa79a8 ffffffff815424ee ffff883ff223e49c 000000015cba0744
[  325.507460]  00000000f2fa7978 0000000000000000 ffff881ff2fa79a8 ffff883ff223e4a0
[  325.514956] Call Trace:
[  325.517412]  [&lt;ffffffff815424ee&gt;] gen_new_estimator+0x8e/0x230
[  325.523250]  [&lt;ffffffff815427aa&gt;] gen_replace_estimator+0x4a/0x60
[  325.529349]  [&lt;ffffffff815718ab&gt;] tc_modify_qdisc+0x52b/0x590
[  325.535117]  [&lt;ffffffff8155edd0&gt;] rtnetlink_rcv_msg+0xa0/0x240
[  325.540963]  [&lt;ffffffff8155ed30&gt;] ? __rtnl_unlock+0x20/0x20
[  325.546532]  [&lt;ffffffff8157f811&gt;] netlink_rcv_skb+0xb1/0xc0
[  325.552145]  [&lt;ffffffff8155b355&gt;] rtnetlink_rcv+0x25/0x40
[  325.557558]  [&lt;ffffffff8157f0d8&gt;] netlink_unicast+0x168/0x220
[  325.563317]  [&lt;ffffffff8157f47c&gt;] netlink_sendmsg+0x2ec/0x3e0

Lets play safe and not use an union : percpu 'pointers' are mostly read
anyway, and we have typically few qdiscs per host.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: John Fastabend &lt;john.fastabend@gmail.com&gt;
Fixes: 22e0f8b9322c ("net: sched: make bstats per cpu and estimator RCU safe")
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
