Nous essayons de déplacer un MongoDB d'une VM AWS vers une machine sur site.
La taille du répertoire de données MongoDB est supérieure à 2,5 To.
L'idée est de mettre la VM sur site en réplique avec la VM AWS.
Le problème est qu'au bout de quelques jours, le processus de réplique se bloque ET toutes les données de notre machine sur site sont détruites :
2020-06-22T01:08:02.850+0200 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host 192.168.26.122:27017 because the pool meets constraints; 2 connections to that host remain open
2020-06-22T01:08:07.911+0200 I ASIO [NetworkInterfaceASIO-RS-0] Ending connection to host 192.168.26.122:27017 due to bad connection status; 1 connections to that host remain open
2020-06-22T01:08:07.911+0200 I REPL [replication-341] Restarting oplog query due to error: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out. Last fetched optime (with hash): { ts: Timestamp(1592780789, 1), t: 25 }[6723760701176776417]. Restarts remaining: 3
2020-06-22T01:08:07.912+0200 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 192.168.26.122:27017
2020-06-22T01:08:07.912+0200 I REPL [replication-341] Scheduled new oplog query Fetcher source: 192.168.26.122:27017 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp(1592780789, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, batchSize: 13981010, term: 25, readConcern: { afterClusterTime: Timestamp(1592780789, 1) } } query metadata: { $replData: 1, $oplogQueryData: 1, $readPreference: { mode: "secondaryPreferred" } } active: 1 findNetworkTimeout: 65000ms getMoreNetworkTimeout: 35000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 839529 -- target:192.168.26.122:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp(1592780789, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, batchSize: 13981010, term: 25, readConcern: { afterClusterTime: Timestamp(1592780789, 1) } } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
2020-06-22T01:08:07.944+0200 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.26.122:27017, took 33ms (2 connections now open to 192.168.26.122:27017)
2020-06-22T01:09:07.985+0200 I REPL [replication-341] Restarting oplog query due to error: ExceededTimeLimit: error in fetcher batch callback: operation exceeded time limit. Last fetched optime (with hash): { ts: Timestamp(1592780789, 1), t: 25 }[6723760701176776417]. Restarts remaining: 2
2020-06-22T01:09:07.986+0200 I REPL [replication-341] Scheduled new oplog query Fetcher source: 192.168.26.122:27017 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp(1592780789, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, batchSize: 13981010, term: 25, readConcern: { afterClusterTime: Timestamp(1592780789, 1) } } query metadata: { $replData: 1, $oplogQueryData: 1, $readPreference: { mode: "secondaryPreferred" } } active: 1 findNetworkTimeout: 65000ms getMoreNetworkTimeout: 35000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 839534 -- target:192.168.26.122:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp(1592780789, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, batchSize: 13981010, term: 25, readConcern: { afterClusterTime: Timestamp(1592780789, 1) } } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
2020-06-22T01:10:12.986+0200 I ASIO [NetworkInterfaceASIO-RS-0] Ending connection to host 192.168.26.122:27017 due to bad connection status; 1 connections to that host remain open
2020-06-22T01:10:12.986+0200 I REPL [replication-341] Restarting oplog query due to error: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out. Last fetched optime (with hash): { ts: Timestamp(1592780789, 1), t: 25 }[6723760701176776417]. Restarts remaining: 1
2020-06-22T01:10:12.987+0200 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 192.168.26.122:27017
2020-06-22T01:10:12.987+0200 I REPL [replication-341] Scheduled new oplog query Fetcher source: 192.168.26.122:27017 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp(1592780789, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, batchSize: 13981010, term: 25, readConcern: { afterClusterTime: Timestamp(1592780789, 1) } } query metadata: { $replData: 1, $oplogQueryData: 1, $readPreference: { mode: "secondaryPreferred" } } active: 1 findNetworkTimeout: 65000ms getMoreNetworkTimeout: 35000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 839538 -- target:192.168.26.122:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp(1592780789, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, batchSize: 13981010, term: 25, readConcern: { afterClusterTime: Timestamp(1592780789, 1) } } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
2020-06-22T01:10:13.019+0200 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.26.122:27017, took 33ms (2 connections now open to 192.168.26.122:27017)
2020-06-22T01:11:17.986+0200 I ASIO [NetworkInterfaceASIO-RS-0] Ending connection to host 192.168.26.122:27017 due to bad connection status; 1 connections to that host remain open
2020-06-22T01:11:17.986+0200 I REPL [replication-341] Error returned from oplog query (no more query restarts left): NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out
2020-06-22T01:11:17.986+0200 I REPL [replication-341] Finished fetching oplog during initial sync: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out. Last fetched optime and hash: { ts: Timestamp(1592780789, 1), t: 25 }[6723760701176776417]
2020-06-22T01:11:31.818+0200 I REPL [replication-341] CollectionCloner ns:chainanalytics.rawdata_ETH_byhash finished cloning with status: IllegalOperation: AsyncResultsMerger killed
2020-06-22T01:11:35.200+0200 W REPL [replication-341] collection clone for 'chainanalytics.rawdata_ETH_byhash' failed due to IllegalOperation: While cloning collection 'chainanalytics.rawdata_ETH_byhash' there was an error 'AsyncResultsMerger killed'
2020-06-22T01:11:35.200+0200 I REPL [replication-341] CollectionCloner::start called, on ns:chainanalytics.rawdata_ICC_0x027b6094ac3DA754FCcC7C088BE04Ca155782A66
2020-06-22T01:11:35.200+0200 W REPL [replication-341] database 'chainanalytics' (2 of 4) clone failed due to ShutdownInProgress: collection cloner completed
2020-06-22T01:11:35.200+0200 I REPL [replication-341] Finished cloning data: ShutdownInProgress: collection cloner completed. Beginning oplog replay.
2020-06-22T01:11:35.200+0200 I REPL [replication-341] Initial sync attempt finishing up.
[...]
2020-06-22T01:11:35.291+0200 E REPL [replication-341] Initial sync attempt failed -- attempts left: 4 cause: NetworkInterfaceExceededTimeLimit: error fetching oplog during initial sync :: caused by :: error in fetcher batch callback:
Operation timed out
2020-06-22T01:11:35.332+0200 I NETWORK [thread12] Successfully connected to 192.168.26.122:27017 (212 connections now open to 192.168.26.122:27017 with a 0 second timeout)
2020-06-22T01:11:35.332+0200 I NETWORK [thread12] scoped connection to 192.168.26.122:27017 not being returned to the pool
2020-06-22T01:11:35.362+0200 I NETWORK [thread12] Starting new replica set monitor for rs0/192.168.10.145:27017,192.168.26.122:27017
2020-06-22T01:11:35.411+0200 I NETWORK [thread12] Successfully connected to 192.168.26.122:27017 (213 connections now open to 192.168.26.122:27017 with a 0 second timeout)
2020-06-22T01:11:35.411+0200 I NETWORK [thread12] scoped connection to 192.168.26.122:27017 not being returned to the pool
2020-06-22T01:11:35.411+0200 I NETWORK [thread12] Starting new replica set monitor for rs0/192.168.10.145:27017,192.168.26.122:27017
2020-06-22T01:11:35.460+0200 I NETWORK [thread12] Successfully connected to 192.168.26.122:27017 (214 connections now open to 192.168.26.122:27017 with a 0 second timeout)
2020-06-22T01:11:35.460+0200 I NETWORK [thread12] scoped connection to 192.168.26.122:27017 not being returned to the pool
2020-06-22T01:11:35.461+0200 I NETWORK [thread12] Starting new replica set monitor for rs0/192.168.10.145:27017,192.168.26.122:27017
2020-06-22T01:11:35.509+0200 I NETWORK [thread12] Successfully connected to 192.168.26.122:27017 (215 connections now open to 192.168.26.122:27017 with a 0 second timeout)
2020-06-22T01:11:35.509+0200 I NETWORK [thread12] scoped connection to 192.168.26.122:27017 not being returned to the pool
2020-06-22T01:11:35.533+0200 I NETWORK [thread12] Starting new replica set monitor for rs0/192.168.10.145:27017,192.168.26.122:27017
2020-06-22T01:11:35.583+0200 I NETWORK [thread12] Successfully connected to 192.168.26.122:27017 (216 connections now open to 192.168.26.122:27017 with a 0 second timeout)
2020-06-22T01:11:35.583+0200 I NETWORK [thread12] scoped connection to 192.168.26.122:27017 not being returned to the pool
2020-06-22T01:11:36.291+0200 I REPL [replication-342] Starting initial sync (attempt 7 of 10)
2020-06-22T01:11:36.293+0200 I STORAGE [replication-342] Finishing collection drop for local.temp_oplog_buffer (no UUID).
2020-06-22T01:11:36.366+0200 I STORAGE [replication-342] createCollection: local.temp_oplog_buffer with no UUID.
2020-06-22T01:11:36.438+0200 I REPL [replication-342] sync source candidate: 192.168.26.122:27017
2020-06-22T01:11:36.438+0200 I STORAGE [replication-342] dropAllDatabasesExceptLocal 3
2020-06-22T01:13:18.957+0200 I COMMAND [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after logicalSessionRecordCache: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 0, after transactions: 0, after wiredTiger: 101957, at end: 101957 }
Une idée ?
N'est-il pas possible d'éviter toute destruction de données si de telles erreurs se produisent et de faire plutôt redémarrer le processus de réplique au point d'interruption ?