Newly installed Gen1 controller connects to local network and cloud, but remains "offline" in web/app

nmpu · February 5, 2021, 1:16am

I recently purchased a Gen 1 8-zone controller on eBay. It was a sealed box with “Certified Factory Refurbished” on the label. Everything looks new. I performed the BlinkUp and can see that the controller is connected to my local network. I have my router set to assign a specific IP address based on the controller MAC address. When I ping the controller’s IP, I get a response. I have also monitored via Wireshark and can see that the controller connects to prd03b.boxen.electricimp.com (hosted by AWS) and maintains a conversation for ~11 minutes before going silent. If I cycle power, the same traffic pattern repeats.

I’m on my 4th support e-mail and they keep telling me to repeat the BlinkUp. I’ve done that, but it makes no difference and I wouldn’t expect it to. Something isn’t working right in the cloud. I suspect it’s because the controller has very old firmware. As I’m typing, I see a bunch of similar issues to the right. I’ve read those posts and there never seems to be any explanation or a confirmed solution. It’s obvious that Rachio outsources much of their infrastructure. Is there anyone within the company who has a grasp of how things actually work? I have a Wireshark log if anyone cares to investigate. This is 100% reproducible. Upon power-up, the controller makes a secure connection to the cloud and eventually stops talking. This is undeniable fact. Without communication, the “offline” status is not going to magically improve.

–David

Stewart · February 5, 2021, 4:37am

Interesting stuff – I did not realize that Gen 1 was so different from newer versions. I agree with your traffic analysis; the TLS handshake is successful and after a few packets of app data, there is nothing but keep alive packets until the server decides to disconnect.

IMO there are two likely possibilities: the old firmware is incompatible with a change at the Electric IMP cloud, or its higher level protocols are incompatible with Rachio’s cloud.

Have you tried deleting the controller from your Rachio account, factory resetting the hardware and starting over? Though unlikely, there may have just been a glitch the first time and there is no firmware problem after all.

After reading your post, I found
https://developer.electricimp.com/troubleshooting/blinkup
which indicates that failures after connecting to the Imp cloud are logged in impCentral. With luck, Rachio will check this for you, or give you the credentials so you can check yourself.

If Rachio won’t cooperate, you may be able to get support from Electric Imp. They were recently bought by Twilio, known for excellent support.

I was unable to find a good contact for them. Closest was

The Los Altos number is Google Voice, maybe someone will answer who can direct you to the right person. Alas, no number for their UK office but I thought it was cool that they are within 0.1 miles of the Raspberry Pi Foundation.

If all else fails, try https://www.electricimp.com/contact-us/ or reaching them through Twilio.

Attempting to connect to https://prd03b.boxen.electricimp.com:31314/ from a browser failed for lack of a valid client certificate, so it’s unlikely that you’ll be able to view the decrypted traffic without a physical attack.

Good luck and let us know what you find.

nmpu · February 5, 2021, 6:21am

Thanks for your interest.

hfiennes at Electric Imp was kind enough to do a quick check based on MAC address. His analysis: "From our side, it’s connecting fine, it has the correct OS version and the latest Rachio application deploy on it, so I can’t comment more as to why it’s not working for you - that would appear to be something in Rachio’s backend.

I do see the connection timing out and being closed, but it’s not clear if that is you turning the device off."

I did not turn off the power. I have determined that the controller always stops communicating after ~11 minutes. Here are more logs. It’s very consistent.

Yes, I tried deleting and re-adding the controller to my Rachio account. It makes no difference. I’ll have to research how to do a factory reset-- if possible.

–David

Stewart · February 5, 2021, 7:13am

Could this be a serial number mismatch, e.g. the serial number on the box came from another unit? Is there a legible number on the back of the unit itself? Does it match the barcode?

Are any incoming connections to the device possible (web, telnet, SSH)? If so, that might be a way to verify the serial and possibly get details about an error.

Based on @hfiennes 's comment, I would assume that the connection attempts should have logged something on the Rachio cloud, at least an error message. I hope that someone at Rachio can search their server logs for the MAC or S/N and tell you what they see.

Based on

you may be out of luck, though apparently if it can’t connect to Wi-Fi for an hour, it goes back to awaiting blinkup and might reset other stuff.

nmpu · February 5, 2021, 8:43am

My controller looks brand new. The label on the underside is pristine. The serial number matches what’s shown in my Rachio account. The controller was added using the barcode. I would hope that Rachio could keep the SN and MAC in sync. I don’t think there’s any mismatch.

I got another reply from hfiennes who matched my Wireshark capture to Electric Imp server logs. His theory is that the Rachio code in the controller is getting stuck in a loop, blocking network responses and a subsequent reconnect. This would be due to a hardware fault rather than a “bug” which should affect lots of users. This is a “certified factory refurbished” unit. Maybe it was a defective return that wasn’t thoroughly tested.

I still don’t understand how my controller can be “offline” while talking to the cloud for 10 minutes. That makes me think there’s some Rachio backend config problem. This was the explanation from franz in several similar posts.

–David

Stewart · February 5, 2021, 3:57pm

I know nothing about the internal architecture. Does the Rachio business logic run on a separate CPU (the Imp is essentially just a fancy NIC)? If so, then the symptoms you’re seeing might be consistent with that CPU freezing or perhaps not running at all.

I don’t understand why Rachio hasn’t provided feedback from their cloud logs, especially given that you have captures with accurate timestamps. If nothing is logged at all, it’s pretty clear that the unit should be replaced. If errors were logged, they might show e.g. corrupted back-end data for your unit, or otherwise suggest corrective action.

nmpu · February 5, 2021, 4:59pm

I believe it’s all the same CPU, but the low-level stuff is “preemptive” (i.e. interrupt driven). The “ping” response is handled by low-level code while the “Keep-Alive” request is intentionally routed to the application. So, the controller will remain connected to Wi-Fi and respond to pings even when the application has crashed. This is a probable explanation for every problem report I’ve read. It even explains why a controller doesn’t follow a preset plan when “offline”.

Assuming all this is correct, the remaining question is why is the application crashing? Rachio needs a watchdog timer. When a Gen 1 is “offline”, there’s no way for a customer to verify the device is functioning.

–David

hfiennes · February 8, 2021, 5:21am

As @nmpu reports from our messages, the original gen1 rachio runs all the embedded application on the imp002 module (which has an STM32F205 in it). In this case impOS (the imp RTOS) is still running, responding to ICMP, etc, but the Rachio application appears to be hung/spinning, hence eventually the imp server marks the device as unresponsive and closes the connection from the server side.

The device never shows online in the app, so I guess this spinning is happening very early in the application boot, before the code has told the Rachio backend that it’s ready for commands. Rachio should be able to do some diagnosis on this, but there’s not much we can do on our end as customer/application data is not something we can look at.