Changes

Sonja Wahl · 92b7e936
--- a/Fantastic-CRYOWRF-errors-and-how-to-solve-them.md
+++ b/Fantastic-CRYOWRF-errors-and-how-to-solve-them.md
-•	**In metgrid.exe: ERROR: Error in ext_pkg_open_for_write_begin.**
+• **In metgrid.exe: ERROR: Error in ext_pkg_open_for_write_begin.**

-Create the output folder defined in namelist.wps
-Ensure this folder is not linked from project
+Create the output folder defined in namelist.wps Ensure this folder is not linked from project

+• **Wrf_real.exe: -------------- FATAL CALLED --------------- FATAL CALLED FROM FILE: LINE: 2987 grid%tsk unreasonable**

-•	**Wrf_real.exe:   -------------- FATAL CALLED --------------- FATAL CALLED FROM FILE:  <stdin>  LINE:    2987 grid%tsk unreasonable**
-
-
-•	**Wrf_real.exe:
----------------- ERROR ------------------- namelist    : NUM_LAND_CAT =         22 input files : NUM_LAND_CAT =         21 (from geogrid selections). ---- ERROR: Mismatch between namelist and wrf input files for dimension NUM_LAND_CAT NOTE:       1 namelist vs input data inconsistencies found. -------------- FATAL CALLED ---------------** 
+• **Wrf_real.exe: ----------------- ERROR ------------------- namelist : NUM_LAND_CAT = 22 input files : NUM_LAND_CAT = 21 (from geogrid selections). ---- ERROR: Mismatch between namelist and wrf input files for dimension NUM_LAND_CAT NOTE: 1 namelist vs input data inconsistencies found. -------------- FATAL CALLED ---------------**

 Run again `convert_geotiff` to make the geographic files consistent.

-
-•	**Mail message with “Slurm Job_id=42755107 Name=WRF Failed, Run time 00:01:42, OUT_OF_MEMORY”**
+• **Mail message with “Slurm Job_id=42755107 Name=WRF Failed, Run time 00:01:42, OUT_OF_MEMORY”**

 Decrease `--ntasks-per-node` and `–ntasks-per-core`

-
-•	**slurmstepd: error: execve(): /scratch/snx3000/gsergi/CRYOWRF_PAMIR_20220317/WRF/test/em_real/./real.exe: No such file or directory**
+• **slurmstepd: error: execve(): /scratch/snx3000/gsergi/CRYOWRF_PAMIR_20220317/WRF/test/em_real/./real.exe: No such file or directory**

 Copy `real.exe` of another compilation to `/WRF/main/`

-
-•	**slurmstepd: error: execve(): /scratch/snx3000/gsergi/CRYOWRF_PAMIR_20220317/WRF/test/em_real/./wrf.exe: No such file or directory**
+• **slurmstepd: error: execve(): /scratch/snx3000/gsergi/CRYOWRF_PAMIR_20220317/WRF/test/em_real/./wrf.exe: No such file or directory**

 Copy `wrf.exe` of another compilation to `/WRF/main/`

+• **\[SmetIO.cc:478\] InvalidFormat: Cannot generate Xdata from file ./input/snpack_1_1_1.sno**

-•	**[SmetIO.cc:478] InvalidFormat: Cannot generate Xdata from file ./input/snpack_1_1_1.sno**
+Archive `io.ini` not well configured. Examples:

-Archive `io.ini` not well configured.
-Examples: 
 - if you have simulations with land grid cells you need to set SNP_SOIL = true

 If you have soil layers activate in `io.ini` the option `SNP_SOIL = true`

-•	**wrf.exe: /project/s1115/CRYOWRF_compilation/CRYOWRF_20220912/snpack_for_wrf/snowpack/snowpack/DataClasses.cc:2395: void SnowStation::initialize(const SN_SNOWSOIL_DATA &, const unsigned long &): Assertion `Edata[e].C<0.' failed.** 
-
-If you have `sn_start_frim_file = .true., .true., .true., .true.,` create the .sno file
-If you have `sn_start_frim_file = .false., .false., .false., .false.,` we are looking the source of error.
+• **wrf.exe: /project/s1115/CRYOWRF_compilation/CRYOWRF_20220912/snpack_for_wrf/snowpack/snowpack/DataClasses.cc:2395: void SnowStation::initialize(const SN_SNOWSOIL_DATA &, const unsigned long &): Assertion \`Edata\[e\].C\<0.' failed.**

+If you have `sn_start_frim_file = .true., .true., .true., .true.,` create the .sno file If you have `sn_start_frim_file = .false., .false., .false., .false.,` we are looking the source of error.

-•	**forrtl: severe (174): SIGSEGV, segmentation fault occurred** 
+• **forrtl: severe (174): SIGSEGV, segmentation fault occurred**

 Can be solved by decreasing the `time_step` and changing the `parent_time_step_ratio`.

-
-
-•	**Tile Strategy is not specified. Assuming 1D-Y Total number of tiles is too big for 1D-Y tiling. Going 2D. New tiling is 2x 17** 
+• **Tile Strategy is not specified. Assuming 1D-Y Total number of tiles is too big for 1D-Y tiling. Going 2D. New tiling is 2x 17**

 Reduce 'cpu-per-node' in WRF_MAIN.job

+• **Problems when restarting**

-•	**Problems when restarting**
+We can have a problem with one type of file, for example we think that snowpack melts all the snow in some lake grid cells. So, SNOWPACK should be fixed, but meanwhile we can just try to **hack it**. To do it we can change the land use of these grid cells to grass (the most similar to lakes). That can be easily do it to the geo_em files as:

-We can have a problem with one type of file, for example we think that snowpack melts all the snow in some lake grid cells. So, SNOWPACK should be fixed, but meanwhile we can just try to __hack it__. To do it we can change the land use of these grid cells to grass (the most similar to lakes). That can be easily do it to the geo_em files as:
 ```
 module load NCO
 cp geo_em.d02.nc geo_em.d02_old.nc
@@ -63,32 +51,28 @@ ncap2 -s 'where(LU_INDEX == 21) LU_INDEX=10' geo_em.d02_old.nc geo_em.d02.nc
 ./metgrid.exe 
 ```

-•	**WRF_real.job or WRF_MAIN.job start and fail without creating a rsl.error.0000 file**
+• **WRF_real.job or WRF_MAIN.job start and fail without creating a rsl.error.0000 file**

 The file do not execute real.exe or wrf.exe. That might be because it is linked to some file that does not exist.

+• **Input data is acceptable to use: ./restart/wrfrst_d02_2022-03-17_15:00:00**

-•	**Input data is acceptable to use: ./restart/wrfrst_d02_2022-03-17_15:00:00**
-
-input_wrf: forcing SIMULATION_START_DATE = head_grid start time
-due to namelist variable reset_simulation_start
+input_wrf: forcing SIMULATION_START_DATE = head_grid start time due to namelist variable reset_simulation_start

-•	**Error entering to new domain**
+• **Error entering to new domain**

 Reduce the number of 'export OMP_STACKSIZE' in WRF_MAIN.job

 Also can be solved by restarting the simulation one hour before. Remember to change in the `namelist.wps` the starting time of the domains, `restart=”.true.”` and the `sn_start_frim_file` as `.true.` in all the domains that run before.

-
-•	**Error in `/scratch/snx3000/gsergi/CRYOWRF_ALPS_2019_ssp585/WRF/./wrf.exe': corrupted size vs. prev_size: 0x000000000f202ef0
-forrtl: error (76): Abort trap signal**
+• **Error in \`/scratch/snx3000/gsergi/CRYOWRF_ALPS_2019_ssp585/WRF/./wrf.exe': corrupted size vs. prev_size: 0x000000000f202ef0 forrtl: error (76): Abort trap signal**

 It seems to be a leakage memory error. May be because the number of snpack layers have been exceded

-•	**At line 2065 of file mediation_integrate.f90
-Fortran runtime error: End of record**
+• **At line 2065 of file mediation_integrate.f90 Fortran runtime error: End of record**

 I found the error in auxhist5. I just comment these lines and worked:
+
 ```
 ! auxhist5_outname                    = "./wrfmean/wrf_mean_d<domain>_<date>.nc",
 ! auxhist5_interval                   = 180, 180, 5, 5,
@@ -96,50 +80,29 @@ I found the error in auxhist5. I just comment these lines and worked:
 ! frames_per_auxhist5                 = 8,8,12,12,
 ```

-•	In eiger:	**Lmod has detected the following error: Swap failed: "PrgEnv-cray" is not
-loaded. Lmod has detected the following error: The following module(s) are unknown:
-"cray-parallel-netcdf"**	
+• In eiger: **Lmod has detected the following error: Swap failed: "PrgEnv-cray" is not loaded. Lmod has detected the following error: The following module(s) are unknown: "cray-parallel-netcdf"**

+you forgot to run `module load cray` before sbatch command

-you forgot to run ``` module load cray ``` before sbatch command
-
-
-•	Does not necessarily fail but gives 	***WARNING* Time in input file not equal to time on domain *WARNING*
- *WARNING* Trying next time in file wrffdda_d01 ...**	
-this happens if the wrffdda_d0x and wrflowinp_d0X files created by real.exe have a specific time frequency that does not agree with your restart time for example. usually it will go through all file entries until it finds the right time information (if time on domain is ahead as would usually happen with restart runs)
+• Does not necessarily fail but gives **_WARNING_ Time in input file not equal to time on domain _WARNING_ _WARNING_ Trying next time in file wrffdda_d01 ...** this happens if the wrffdda_d0x and wrflowinp_d0X files created by real.exe have a specific time frequency that does not agree with your restart time for example. usually it will go through all file entries until it finds the right time information (if time on domain is ahead as would usually happen with restart runs)

 - **Program received signal SIGSEGV: Segmentation fault - invalid memory reference.** when entering in higher domain.
 - increasing timestep ratio helped
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-**Message from syslogd@eiger-ln002 at Jun  4 08:09:13 ...
- kernel:[Hardware Error]: Corrected error, no action required.

-Message from syslogd@eiger-ln002 at Jun  4 08:09:13 ...
- kernel:[Hardware Error]: CPU:66 (17:31:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
+---
+
+**\*\*Message from syslogd@eiger-ln002 at Jun 4 08:09:13 ... kernel:\[Hardware Error\]: Corrected error, no action required.**

-Message from syslogd@eiger-ln002 at Jun  4 08:09:13 ...
- kernel:[Hardware Error]: Error Misc: 0x0000000000000000
+**Message from syslogd@eiger-ln002 at Jun 4 08:09:13 ... kernel:\[Hardware Error\]: CPU:66 (17:31:0) MC17_STATUS\[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-\]: 0xdc2040000000011b**

-Message from syslogd@eiger-ln002 at Jun  4 08:09:13 ...
- kernel:[Hardware Error]: Error Addr: 0x00000004a538d540
+**Message from syslogd@eiger-ln002 at Jun 4 08:09:13 ... kernel:\[Hardware Error\]: Error Misc: 0x0000000000000000**

-Message from syslogd@eiger-ln002 at Jun  4 08:09:13 ...
- kernel:[Hardware Error]: PPIN: 0x02b4908030364093
+**Message from syslogd@eiger-ln002 at Jun 4 08:09:13 ... kernel:\[Hardware Error\]: Error Addr: 0x00000004a538d540**

-Message from syslogd@eiger-ln002 at Jun  4 08:09:13 ...
- kernel:[Hardware Error]: IPID: 0x0000009600450f00, Syndrome: 0x000010000a801100
+**Message from syslogd@eiger-ln002 at Jun 4 08:09:13 ... kernel:\[Hardware Error\]: PPIN: 0x02b4908030364093**

-Message from syslogd@eiger-ln002 at Jun  4 08:09:13 ...
- kernel:[Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
+**Message from syslogd@eiger-ln002 at Jun 4 08:09:13 ... kernel:\[Hardware Error\]: IPID: 0x0000009600450f00, Syndrome: 0x000010000a801100**

-Message from syslogd@eiger-ln002 at Jun  4 08:09:13 ...
- kernel:[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD**
------------------------------------------------------------------------------------------------------
+**Message from syslogd@eiger-ln002 at Jun 4 08:09:13 ... kernel:\[Hardware Error\]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.**

- **-------------- FATAL CALLED ---------------
-FATAL CALLED FROM FILE:  <stdin>  LINE:      70
- program wrf: error opening wrfinput_d01 for reading ierr=       -1021
-------------------------------------------
-MPICH Notice [Rank 0] [job id 3107422.0] [Tue Jun  4 08:34:42 2024] [nid002237] - Abort(1) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
-aborting job:
-application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0**
\ No newline at end of file
+**Message from syslogd@eiger-ln002 at Jun 4 08:09:13 ... kernel:\[Hardware Error\]: cache level: L3/GEN, tx: GEN, mem-tx: RD\*\***
\ No newline at end of file