Generation 2 offline - Rachio DNS bug

I have a generation 2 that keeps going “offline”. After multiple resets it will sometimes come back. But I finally got around to taking a closer look after it happened this last time and I couldn’t get it to come back and my grass started to die in the heat.

Upon inspection, it connects to the Wi-Fi and network fine, gets an IP, etc. However, it shows up as offline because it can’t connect to the Rachio cloud. Specifically, it can’t resolve mqtt.rach.io. The problem isn’t the DNS server (which will resolve mqtt.rach.io just fine), the problem is that the controller is issuing a flawed DNS request so the DNS server rejects it. I belive this to be a bug in the firmware.

When the controller connects to the network, it pulls an IP via DHCP which also provides the DNS server addresses for the network. The controller sends a DNS request to the DNS server (correct layer 2 headers and everything it makes it there fine) BUT, the destination IP address of the DNS request from the controller to the DNS server is 0.0.0.0 (I’m talking about the layer 3 (IP) header here) so the DNS server rejects it (and before you ask, the DNS server advertized was not 0.0.0.0.) So the controller can’t resolve the domain for mqtt.rach.io (also pool.ntp.org but that is only for NTP and isn’t contributing to the offline problem although it is suffering from the same root cause) and thus is marked “offline”.

AND THEN, the controller doesn’t try the other DNS severs it should know about. When the controller pulled an IP, the DHCP reply provided multiple DNS servers for the controller to use but it seems to only use the first one and not use the others! Really?

And on top of that, if you look at the DNS requests from the controller, the transaction IDs are 0x0000 (and sometimes 0x0001 I guess when it retries but sequential transaction IDs shouldn’t be the behavior either because of security concerns). Really? Something clearly isn’t right with the traffic coming from the controller. Perhaps the 0x0000 transaction ID values are intentional (or a byproduct of the poor resolver code) but from a security perspective, this is an issue.

But at this point I’m willing to overlook the security issues, I just want the controller to issue proper DNS requests. If you can’t tell I’m pretty frustrated by this. Is there a patch or firmware upgrade I can apply to fix this? I realize there may be a chicken-and-egg problem if the controller can’t connect to the cloud though…

Let me know if you don’t believe me and need packet captures.

Thanks!

Can you run this and provide the generated code? I will forward to the engineering team for review.

:cheers:

@flowbissa wow you gave a detailed analysis. BTW, I assume you have other devices that don’t exhibit this behavior? Based on other posts and comments i’ve seen, rachio does appear to have network issues more often than other devices.

Yes, other devices on the network are working.

Because my grass was withering, I ended up trying a few desperate things including spoofing DNS replies from the DNS server (made much easier because the controller kept using the same src port in the request, the transaction ID was always the same – 0x0000 – and of course it’s UDP), spoofing DNS requests from the controller (with the dst IP set properly) to get the DNS server to reply back to the controller, and resetting a few things on the network. In the end it started working. I think resetting the DHCP server and forcing the controller to renew the IP may have been what did it although previously, going thru the connect/setup process for the controller was not successful (and that involves DHCP and getting an IP) so I’m not sure exactly sure what fixed it and it is tough to troubleshoot now that it is working at the moment…

But it seems that in some initial state, if DNS works, then all is good for the moment, but there is a network hiccup, or perhaps if there need to be a subsequent lookup, if it doesn’t go smoothly, then the controller ends up in a buggy state, sending DNS requests with 0.0.0.0 as the dst address in the IP header, and using 0x0000 transaction ID.

I’m going to have to wait until it goes offline again and see if I can nail it down. Do you guys make the source code available or perhaps if third party networking libraries libraries are used for the communication, can you share which ones and versions?

Thank you.

Hi,

What I think you’re seeing is actually what happens when DHCP doesn’t work correctly and the controller goes into auto-ip mode (which isn’t really needed but is part of the Apple provisioning setup).
The controller should eventually try DHCP again and start doing what you expect.

On the port and transaction id keep in mind the dns resolver in the controller needs to fit into less than a kilobyte of memory. It does a single dns request at a time and doesn’t really do any state tracking so there’s no port randomisation etc. Most of the time it just needs to resolve a single address on boot up so it’s sufficient for the job it’s doing.

Similary the DHCP client the controller uses has to fit into a small footprint. I think it is compliant with the relavent RFCs but it is possible some DHCP servers can upset it. The common ones shipped in consumer routers (isc, dnsmasq etc) seem to all work fine but I think people have had issues with some of the firewall on a disc setups.

Cheers,

Daniel

2 Likes

Interesting, why would a non-apple device need to use an Apple provisioning setup? Based on @flowbissa’s, response he knows what he’s talking about. I hope the rachio team will reach and work with @flowbissa and figure out what’s going on.

BTW, could this be the same issue where people have to sometimes use an iOS device to set up their device?

Apple being Apple the only way to do wifi onboarding with iOS is via their official onboarding method. There’s no other way to do it on iOS.

The Android and iOS onboarding flows are totally separate so sometimes one works where the other doesn’t.

3 Likes

Thanks for all the good information!

I did use an iPhone to do the setup.

The DHCP server is dnsmasq. I looked at the config and realized that for that network, only one DNS server address is being returned by the DHCP server so even though I said it wasn’t trying alternate DNS server, I can’t really say that.

The controller went offline again and I saw the DNS requests to host 0.0.0.0. I restarted dnsmasq (DNS server) the controller stayed offline. I rebooted the router (which is where dnsmasq runs) and the controller came back online.

I’ll troubleshoot some more after it drops again (and as time permits). I did set dnsmasq to send another DNS server address in the DHCP reply.

1 Like

I think I am seeing similar behavior. I found out if I cache the IP addresses of mqtt.rach.io by doing a dig on it on my router, the controller will go online . Which is really odd. My post about this is here.

I am seeing the exact same behavior. Will connect for a variable period of time and then go offline. Has anyone found a fix for this?

Hi Franz,
Can you help me with a similar issue. I have submitted requests with support team and cannot receive the help required to fix this issue.