Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retries on API or wifi fail #29

Open
YouCanNotBeSerious opened this issue Apr 11, 2023 · 31 comments
Open

Retries on API or wifi fail #29

YouCanNotBeSerious opened this issue Apr 11, 2023 · 31 comments
Assignees
Labels
enhancement New feature or request

Comments

@YouCanNotBeSerious
Copy link

YouCanNotBeSerious commented Apr 11, 2023

Hi -

VERY NICE PROJECT for a start - thanks for sharing it with us all.

I have flashed the code onto a spare board to play with before committing to buying the display .... what I see so far (debug terminal) is that on a wifi fail, or API request fail we wait until the next wake time to try again. I am examining the code (as an amateur!) now, but if this is something I have misconfigured or is not expected plz let me know :-)

I'll try to work out in code how to try say 3 times 1 minute apart on a fail before failing until the next wake time - already I see:

display.powerOff();
beginDeepSleep(startTime, &timeInfo);

on a fail of fetching the time so will start there.

@lmarzen
Copy link
Owner

lmarzen commented Apr 11, 2023

Hi @YouCanNotBeSerious,

Sounds like you are understanding everything correctly. You are looking in the right places, most of the logic you are looking for is in main.cpp.

try say 3 times 1 minute apart on a fail before failing until the next wake time

First off, that is an excellent idea and definitely possible.

Let me share some more context about the project. One of the primary goals and motivations of this project is that it is low power enough to run on battery for extended periods (months). For this reason, much effort was taken to reduce time spent awake as much as possible. One thing to keep in mind is that when the esp32 wakes up from deep sleep, it enters the setup() function. This can make it somewhat less trivial to keep track of the state of the weather display. However there are solutions, ideally we store the state of the project to non-volatile storage (nvs) before entering deep-sleep then when we awake we can read from nvs. An example of this is low battery behavior. When low battery happens we want the display to show the low battery warning and then no longer update the display until the battery level is high enough again. So if the display wakes up when it has low battery it will check nvs to see if the low battery screen is already displayed. If so it will immediately sleep.

In regards to wifi, attempts are made to connect for 10s, after that it's considered a lost cause so we give up.
For api calls, up to 3 consecutive attempts are made for each api call. If it fails 3 times in a row we again consider it a lost cause and will try again at the next wake period.

You may have seen this option already but I thought I'd mention it cause it is related to this discussion, SLEEP_DURATION = 30 in config.cpp. Probably not what you have in mind, but if you simply wanted the display to update more frequently you could decrease this from 30 minutes.

-Luke

@YouCanNotBeSerious
Copy link
Author

YouCanNotBeSerious commented Apr 11, 2023

Thanks for the reply!

In regards to wifi, attempts are made to connect for 10s, after that it's considered a lost cause so we give up.
For api calls, up to 3 consecutive attempts are made for each api call. If it fails 3 times in a row we again consider it a lost cause and will try again at the next wake period.

I just found most of this logic :-)

What I am actually thinking right now is maybe I've given myself an xy problem - perhaps I should start with why the board doesn't connect to wifi every wake first.

FYI:

rst:0x5 (DEEPSLEEP_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0030,len:1184
load:0x40078000,len:13104
load:0x40080400,len:3036
entry 0x400805e4
Battery voltage: 4.11
Connecting to SSID
IP: 192.168.0.70
Tuesday, April 11, 2023 12:00:06
OpenWeatherMap One Call 3.0 API error: -1 connection refused
OpenWeatherMap Air Pollution API error: -1 connection refused
OpenWeatherMap Air Pollution API error: -1 connection refused
_PowerOn : 124970
_Update_Full : 7616001
_PowerOff : 40001
Status: 
Awake for 38.473s
Deep-sleep for 1774s
ets Jul 29 2019 12:21:46

rst:0x5 (DEEPSLEEP_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0030,len:1184
load:0x40078000,len:13104
load:0x40080400,len:3036
entry 0x400805e4
Battery voltage: 4.11
Connecting to SSID
IP: 192.168.0.70
Tuesday, April 11, 2023 12:30:17
_PowerOn : 124997
_Update_Full : 7615001
_PowerOff : 40001
Status: 
Awake for 13.590s
Deep-sleep for 1786s

For the log above, the display does NOT go to the API error screen. I have yet to capture a log where the error screen gets called. When it does, it seems to then go to sleep until the next interval (currently one hour) so thedisplay would have that error in place for an hour, I want to add some retries, or leave the previous forecast on screen.

@lmarzen
Copy link
Owner

lmarzen commented Apr 11, 2023

This appears to be the expected behavior, correct?

@YouCanNotBeSerious
Copy link
Author

Yes - sorry, I was mid-edit on my previous post, please see that one.

@lmarzen
Copy link
Owner

lmarzen commented Apr 11, 2023

If you are having issues with your board connecting to wifi on first wake, are you referring to the first wake after you flashed new code?

Particularly when you flash new code it is not uncommon to fail to connect to wifi for whatever reason. I would simply hit the reset button.

@lmarzen
Copy link
Owner

lmarzen commented Apr 11, 2023

You might also try increasing the timeout for the http GET request to openweathermap if you are having issues with it timing out.

@YouCanNotBeSerious
Copy link
Author

Unfortunately no, it's after a few successful updates e.g. I have this test running on my workbench and yesterday it was fine all day (1 hr updates) then the one at 2pm failed - 3pm fixed it.

The wifi connect is less worrisome, as above you confirmed it tries for 10s and I could increase that to say 30 but beyond that I need to look at my wifi setup.

What about as a quick change, if I comment out this under each of the air quality and Onecall blocks?

do
    {
      drawError(wi_cloud_down_196x196, statusStr, tmpStr);
    } while (display.nextPage());

@YouCanNotBeSerious
Copy link
Author

An unusual "complaint" but you are responding so fast I am getting out of sync :-) :-) :-)

You might also try increasing the timeout for the http GET request to openweathermap if you are having issues with it timing out.

Have done so on wifi connect as well as the GETs, thanks.

@lmarzen
Copy link
Owner

lmarzen commented Apr 11, 2023

What about as a quick change, if I comment out this under each of the air quality and Onecall blocks?

do
    {
      drawError(wi_cloud_down_196x196, statusStr, tmpStr);
    } while (display.nextPage());

This code only draws the error to the display. Maybe I am misunderstanding what the goal of commenting this out is?

@YouCanNotBeSerious
Copy link
Author

I was hoping to leave the previous forecast on screen until the next wake attempt - at the moment I see it pushes the error to the screen; Old info on the screen would be less jarring to me that an error message being there for an hour - I see the pain that it also hides the fact there was an error so maybe 3 attempts a few mins apart, leaving the old forecast on screen, and finally displaying the error?

For now I am increasing the timeout 10s --> 20s on wifi connect each wakeup, and the attempts for the GETs under API calls from 3-->5 and will run it a few hours.

I suspect most of this is a manifestation of being on a slow internet connection with long ping times to the OpenWeather servers half a world away, apart from the wifi connection failures which are mine to own.

@lmarzen
Copy link
Owner

lmarzen commented Apr 11, 2023

Oh then, yes comment out the drawError messages and it should have the effect you desire.

@YouCanNotBeSerious
Copy link
Author

Thanks - for now will run with the increased connection attempts and report back in a few days!

Also on the wifi connection attempts - totally my fault, once I bothered to look at RSSI it was way too low, made no sense, until I realised ESP only uses 2.4GHz and I am well covered with 5GHz all over the house but 2.4GHz is only served by one access point which not surprisingly is not close to where I had the weather station.

@lmarzen
Copy link
Owner

lmarzen commented Apr 11, 2023

Update me if increasing those timeout times works, I could increase the default for the project. Admittedly this is a hard thing for me to control for when I am testing so the default times picked are somewhat arbitrary.

@YouCanNotBeSerious
Copy link
Author

Absolutely will update, I got a whole weather forecast e-paper project for no effort on my part, will contribute back anything I can find !

reminder set for a couple of days :-)

@YouCanNotBeSerious
Copy link
Author

So here's the update - tl;dr is this is now a great fixture on my wall and I am immensely grateful to you.

I made the change in the drawError routine to not redraw the display on error - keeping the previous hour's display (I have it set for hourly updates). I also changed the retries in a couple of places for GETting the data from 3 to 5.

I also investigated the underlying cause, and apart from the no connect on first reboot after programming which as you mentioned is just a thing, my internet has several hundred ms pings to the openweathermap servers, with occasional (rough estimate 25% of the time) complete drops of the connection when using a web browser. I don't have the best internet but usually don't notice webpages having a problem, but manually browsing OpenWeather I do occasionally have to manually refresh after getting a 40x error.

So - a few more retries and keeping the last hour's refresh on fail has made the display from a common user perspective (read that as "wife perspective") pretty dang good. No one cares if the weather data is an hour or two out of date, but a big honkin' error screen draws attention.

What would be ideal is say after 3 hours of retires we give up and show an error, but that is a bit beyond me at this stage.

I got my e-ink display and built the project into a small picture frame to hang on the wall - cutting the border paper mat that came with the frame to the size of the display is a very easy way to get a clean looking build.

Thanks again - now please provide code to make the project produce weather instead of predicting :-)

@lmarzen
Copy link
Owner

lmarzen commented Apr 18, 2023

Thanks for looking into this.
This is very valuable feedback, I will implement your suggestion. (I've gotten pretty busy recently, but I'll get around to it in the coming weeks)

I'll keep this issue open until then, and I'll tag it when I push updates related to your feedback.

@lmarzen lmarzen added the enhancement New feature or request label Apr 21, 2023
@lmarzen lmarzen self-assigned this Apr 21, 2023
@ChiefPoints
Copy link

I've been looking at this, as well, and have a theory on what's going on. I'm only seeing the api connect errors on the Air Quality request, never on the main weather update request that occurs first. Sometimes it's a time out, and others it's an outright connection refused, as in maybe it's hammering on the connection too quickly after the weather api request. I'm thinking if you can add in a small (or increased) delay before it attempts to grab the Air Quality api info that should alleviate or remove the errors. The logs show that it usually catches on the 2nd or 3rd retry, and I probably see the API error msg on the display about 2 or 3 times a week.

@lmarzen
Copy link
Owner

lmarzen commented Apr 26, 2023

@ChiefPoints good idea

@YouCanNotBeSerious
Copy link
Author

@ChiefPoints now that I think about it, I'm fairly sure I did see the AQI error more often than any other. Note my distance form the servers and general internet connection probably still has a lot to do with it - I was seeing "connection refused" errors even in a web browser, though seemingly not as often as the ESP got.

@YouCanNotBeSerious
Copy link
Author

Just popping in again what...4 months later to say the project is still on my wall, still getting compliments, and with the hack I put in place has had zero issues to date.

Thanks again Luke! and anyone else wanting a nice easy mounting system for this - try a small picture frame :-)

@lmarzen
Copy link
Owner

lmarzen commented Aug 27, 2023

Thanks for following up :). I'm glad to hear it is working well after your patches. It's been on my todo list to properly address this for quite some time and I just want to reiterate my intent to fix this in the main project sometime in the medium-term future. @YouCanNotBeSerious you are welcome to open a pull request if you want with your fixes and would be happy to review them and get them merged into the main project.

PS - the picture frame sounds like a very sleek solution!

@YouCanNotBeSerious
Copy link
Author

My "fixes" really are hacky - it's just hiding the issue so I think the quality of your project should be preserved by waiting for real fixes :-D

Attached a pic of my physical implementation - the picture frame is a real simple way imho to get a clean look, just get a deep frame to suit the battery as the fattest component, and the advantage with a frame is you can cut a mask from thin card to suit the screen size vs frame size.
Screen Shot 2023-08-28 at 10 52 21 AM

@lmarzen
Copy link
Owner

lmarzen commented Aug 27, 2023

WOW! That looks really good! Thank you so much for sharing O_O

@srt19170
Copy link
Contributor

Ditto what everyone has said about how great this project is -- many thanks!

I have a consistent problem with time synchronization via NTP failing (at least the initial synch). I believe that the initial synch takes about 30 seconds (asynchronous to the main process), so it's likely that after startup synchronization won't have succeeded before getLocalTime() is called. I addressed this by watching the synch status for completion in the setup:

bool setupTime(tm *timeInfo)
{
  // passing 0 for gmtOffset_sec and daylightOffset_sec and instead use setenv()
  // for timezone offsets
  // configTime(0, 0, NTP_SERVER_1, NTP_SERVER_2);
  configTzTime(TIMEZONE, NTP_SERVER_1, NTP_SERVER_2);
  // setenv("TZ", TIMEZONE, 1);
  // tzset();

  // Wait for SNTP synchronization to complete
  int retry = 0;
  const int retry_max = 10;  // Maximum number of retries
  while ((sntp_get_sync_status() == SNTP_SYNC_STATUS_RESET) && (retry < retry_max)) {
    Serial.printf("Waiting for SNTP synchronization, attempt %d of %d...\n", retry + 1, retry_max);
    delay(10000);  // Delay for 10 seconds before retrying
    retry++;
  }
  
  return printLocalTime(timeInfo);
} // setupTime

This usually succeeds on the third time through the loop as expected and fixes the missing time problem for me. Also, I'm not sure you need to setup NTP every time you awaken, but I haven't had a chance to test that yet.

On a related note, this loop in printLocalTime:

  int attempts = 0;
  while (!getLocalTime(timeInfo) && attempts++ < 3)
  {
    Serial.println("Failed to obtain time");
    return false;
  }

Probably doesn't do much; I think the only way getLocalTime() can fail is if the time has not been set and that's not likely to change during three calls in rapid succession. But maybe I'm missing something there.

@lmarzen
Copy link
Owner

lmarzen commented Oct 18, 2023

These are all great finds @srt19170.

Feel free to create a pull request with these fixes, otherwise, I will try and get to it this weekend.

@lmarzen lmarzen closed this as completed Oct 18, 2023
@lmarzen lmarzen reopened this Oct 18, 2023
@srt19170
Copy link
Contributor

srt19170 commented Oct 18, 2023

Upon further investigation, it appears that the SNTP process is lost during deep sleep, so you do need to configure and start it again upon wakeup.

It appears that the system keeps track of time pretty well across the deep sleep cycles. Timezone has to be corrected after deep sleep, but after that the time seems to be accurate. Given that the time being off even (say) five minutes wouldn't have much impact on the display, you could probably optimize and only do an SNTP update every couple of hours, or even just once a day. But that might be an unnecessary complication, given that it will only save a small amount of power.

@srt19170
Copy link
Contributor

I've submitted a pull request with the proposed fix for more reliable NTP synchronization. (Hopefully I did it correctly; not my forte!)

@lmarzen
Copy link
Owner

lmarzen commented Oct 20, 2023

Thanks to @srt19170. Improvements have been made to the SNTP process. #59 should hopefully improve the reliability of time synchronization.

@srt19170
Copy link
Contributor

srt19170 commented Nov 10, 2023

Since @YouCanNotBeSerious posted his photo in this thread I'll do the same for my finished version:
PXL_20231110_142531867~3
Box constructed of cherry and cocobolo, with a carved scene of "weather" on the front panel.

Thanks again for documenting your project so thoroughly and helping me out. I certainly could not have done this on my own!

@lmarzen
Copy link
Owner

lmarzen commented Nov 10, 2023

WOW! That is beautiful! 10/10.

@YouCanNotBeSerious
Copy link
Author

@srt19170 I updated with your NTP bits, working great!
@lmarzen I just want to re-iterate, we are all only making these great looking devices for our homes because of your work :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants