309 Commits

Author SHA1 Message Date
Salvatore Sanfilippo
5c5761bcd6 Merge branch 'unstable' into modules_fork 2019-09-27 11:24:06 +02:00
Salvatore Sanfilippo
f11b71b9b8 Merge pull request #6235 from oranagra/module_rdb_load_errors
Allow modules to handle RDB loading errors.
2019-09-26 11:52:42 +02:00
antirez
e87768e8ca Replication: clarify why repl_put_online_on_ack exists at all. 2019-08-05 17:38:15 +02:00
Oran Agra
4b9c5d8536 Log message when modules prevent diskless-load 2019-07-30 16:32:58 +03:00
Oran Agra
9fa285523c Avoid diskelss-load if modules did not declare they handle read errors 2019-07-30 15:11:57 +03:00
Oran Agra
e70fbad802 Module API for Forking
* create module API for forking child processes.
* refactor duplicate code around creating and tracking forks by AOF and RDB.
* child processes listen to SIGUSR1 and dies exitFromChild in order to
  eliminate a valgrind warning of unhandled signal.
* note that BGSAVE error reply has changed.

valgrind error is:
  Process terminating with default action of signal 10 (SIGUSR1)
2019-07-17 16:40:24 +03:00
antirez
ecd65cf9cf Diskless replica: fix disklessLoadRestoreBackups() bug. 2019-07-10 12:36:26 +02:00
antirez
26c4e8a965 Diskless replica: refactoring of DBs backups. 2019-07-10 11:42:26 +02:00
antirez
6e9ffa93eb Diskless replica: a few aesthetic changes to replication.c. 2019-07-08 18:32:47 +02:00
Oran Agra
29754ebe22 diskless replication on slave side (don't store rdb to file), plus some other related fixes
The implementation of the diskless replication was currently diskless only on the master side.
The slave side was still storing the received rdb file to the disk before loading it back in and parsing it.

This commit adds two modes to load rdb directly from socket:
1) when-empty
2) using "swapdb"
the third mode of using diskless slave by flushdb is risky and currently not included.

other changes:
--------------
distinguish between aof configuration and state so that we can re-enable aof only when sync eventually
succeeds (and not when exiting from readSyncBulkPayload after a failed attempt)
also a CONFIG GET and INFO during rdb loading would have lied

When loading rdb from the network, don't kill the server on short read (that can be a network error)

Fix rdb check when performed on preamble AOF

tests:
run replication tests for diskless slave too
make replication test a bit more aggressive
Add test for diskless load swapdb
2019-07-08 15:37:48 +03:00
antirez
9eea57cc31 Narrow the effects of PR #6029 to the exact state.
CLIENT PAUSE may be used, in other contexts, for a long time making all
the slaves time out. Better for now to be more specific about what
should disable senidng PINGs.

An alternative to that would be to virtually refresh the slave
interactions when clients are paused, however for now I went for this
more conservative solution.
2019-05-15 12:16:43 +02:00
Salvatore Sanfilippo
3f4f7aff1a Merge pull request #6029 from chendq8/clientpause
fix cluster failover time out
2019-05-15 12:03:19 +02:00
chendianqiang
1e29403134 stop ping when client pause 2019-04-17 21:20:10 +08:00
Salvatore Sanfilippo
641359787c Merge pull request #3830 from oranagra/diskless_capa_pr
several bugfixes to diskless replication
2019-03-22 17:41:40 +01:00
antirez
7e191d3ea3 More sensible name for function: restartAOFAfterSYNC().
Related to #3829.
2019-03-21 17:21:29 +01:00
antirez
4c49e7ad6f Mostly aesthetic changes to restartAOF().
See #3829.
2019-03-21 17:18:24 +01:00
Oran Agra
544b9b0826 diskless replication - notify slave when rdb transfer failed
in diskless replication - master was not notifing the slave that rdb transfer
terminated on error, and lets slave wait for replication timeout
2019-03-20 17:46:19 +02:00
oranagra
4355e29749 bugfix to restartAOF, exit will never happen since retry will get negative.
also reduce an excess sleep
2019-03-20 17:20:07 +02:00
antirez
340d03b64b replicaofCommand() refactoring: stay into 80 cols. 2019-03-18 11:34:40 +01:00
antirez
beaacf96cd Make comment in #5911 stay inside 80 cols. 2019-03-10 09:48:06 +01:00
John Sully
a3e3a24057 Replicas aren't allowed to run the replicaof command 2019-03-09 11:04:48 -05:00
zhaozhao.zz
de0f42bff3 ACL: add masteruser configuration for replication
In mostly production environment, normal user's behavior should be
limited.

Now in redis ACL mechanism we can do it like that:

    user default on +@all ~* -@dangerous nopass
    user admin on +@all ~* >someSeriousPassword

Then the default normal user can not execute dangerous commands like
FLUSHALL/KEYS.

But some admin commands are in dangerous category too like PSYNC,
and the configurations above will forbid replica from sync with master.

Finally I think we could add a new configuration for replication,
it is masteruser option, like this:

    masteruser admin
    masterauth someSeriousPassword

Then replica will try AUTH admin someSeriousPassword and get privilege
to execute PSYNC. If masteruser is NULL, replica would AUTH with only
masterauth like before.
2019-02-12 17:12:37 +08:00
ArkayZheng
1282b83681 Fix the output bug in rename exceptions. 2019-01-25 21:48:23 +08:00
antirez
da54f1fd3f Refactoring: always kill AOF/RDB child via helper functions. 2019-01-21 11:28:44 +01:00
Salvatore Sanfilippo
081ef93364 Merge branch 'unstable' into fixChildInfoPipeFdLeak 2019-01-21 11:20:56 +01:00
Salvatore Sanfilippo
3faf82d14b Merge pull request #5797 from trevor211/fixUpdateDictResizePolicy
Fix update dict resize policy
2019-01-21 11:14:48 +01:00
WuYunlong
1e1a5ceb65 Fix child info pipe fd leak when child process gets killed. 2019-01-21 17:48:45 +08:00
WuYunlong
f02ae4788b Update dict resize policy when rdb child process gets killed. 2019-01-21 17:33:18 +08:00
antirez
e018c7c83f ACL: configure the master connection without user. 2019-01-17 18:33:36 +01:00
antirez
89b7b6a917 RESP3: addReplyString() -> addReplyProto().
The function naming was totally nuts. Let's fix it as we break PRs
anyway with RESP3 refactoring and changes.
2019-01-09 17:00:30 +01:00
antirez
166cdd3583 RESP3: Use new deferred len API in replication.c. 2019-01-09 17:00:29 +01:00
antirez
2e14b6bbca When replica kills a pending RDB save during SYNC, log it.
This logs what happens in the context of the fix in PR #5367.
2018-10-31 11:47:10 +01:00
Salvatore Sanfilippo
2a5ac7ad88 Merge pull request #5367 from nUl1/fullresync-stopbgsave
Prevent RDB autosave from overwriting full resync results
2018-10-31 11:42:04 +01:00
antirez
a6e595c778 Fix typo in replicationCron() comment. 2018-10-05 18:30:45 +02:00
Andrey Bugaevskiy
3249c4dc79 Move child termination to readSyncBulkPayload 2018-09-27 19:38:58 +03:00
Andrey Bugaevskiy
8b8c4ab5aa Prevent RDB autosave from overwriting full resync results
During the full database resync we may still have unsaved changes
on the receiving side. This causes a race condition between
synced data rename/load and the rename of rdbSave tempfile.
2018-09-19 19:58:39 +03:00
antirez
d6d9b778ac Slave removal: replication.c logs fixed. 2018-09-11 15:32:28 +02:00
antirez
bdca5b9be5 Slave removal: SLAVEOF -> REPLICAOF. SLAVEOF is now an alias. 2018-09-11 15:32:28 +02:00
Oran Agra
c9ce6aa4a8 fix rare replication stream corruption with disk-based replication
The slave sends \n keepalive messages to the master while parsing the rdb,
and later sends REPLCONF ACK once a second. rarely, the master recives both
a linefeed char and a REPLCONF in the same read, \n*3\r\n$8\r\nREPLCONF\r\n...
and it tries to trim two chars (\r\n) from the query buffer,
trimming the '*' from *3\r\n$8\r\nREPLCONF\r\n...

then the master tries to process a command starting with '3' and replies to
the slave a bunch of -ERR and one +OK.
although the slave silently ignores these (prints a log message), this corrupts
the replication offset at the slave since the slave increases the replication
offset, and the master did not.

other than the fix in processInlineBuffer, i did several other improvments
while hunting this very rare bug.

- when redis replies with "unknown command" it includes a portion of the
  arguments, not just the command name. so it would be easier to understand
  what was recived, in my case, on the slave side,  it was -ERR, but
  the "arguments" were the interesting part (containing info on the error).
- about a year ago i added code in addReplyErrorLength to print the error to
  the log in case of a reply to master (since this string isn't actually
  trasmitted to the master), now changed that block to print a similar log
  message to indicate an error being sent from the master to the slave.
  note that the slave is marked as CLIENT_SLAVE only after PSYNC was received,
  so this will not cause any harm for REPLCONF, and will only indicate problems
  that are gonna corrupt the replication stream anyway.
- two places were c->reply was emptied, and i wanted to reset sentlen
  this is a precaution (i did not actually see such a problem), since a
  non-zero sentlen will cause corruption to be transmitted on the socket.
2018-07-17 12:51:49 +03:00
Oran Agra
28cf208bf3 slave buffers were wasteful and incorrectly counted causing eviction
A) slave buffers didn't count internal fragmentation and sds unused space,
   this caused them to induce eviction although we didn't mean for it.

B) slave buffers were consuming about twice the memory of what they actually needed.
- this was mainly due to sdsMakeRoomFor growing to twice as much as needed each time
  but networking.c not storing more than 16k (partially fixed recently in 237a38737).
- besides it wasn't able to store half of the new string into one buffer and the
  other half into the next (so the above mentioned fix helped mainly for small items).
- lastly, the sds buffers had up to 30% internal fragmentation that was wasted,
  consumed but not used.

C) inefficient performance due to starting from a small string and reallocing many times.

what i changed:
- creating dedicated buffers for reply list, counting their size with zmalloc_size
- when creating a new reply node from, preallocate it to at least 16k.
- when appending a new reply to the buffer, first fill all the unused space of the
  previous node before starting a new one.

other changes:
- expose mem_not_counted_for_evict info field for the benefit of the test suite
- add a test to make sure slave buffers are counted correctly and that they don't cause eviction
2018-07-16 16:43:42 +03:00
Jack Drogon
bae1d36e5d Fix typo 2018-07-03 18:19:46 +02:00
antirez
574b79fea6 Set repl_down_since to zero on state change.
PR #5081 fixes an "interesting" bug about Redis Cluster failover but in
general about the updating of repl_down_since, that is used in order to
count the time a slave was left disconnected from its master.

While the fix provided resolves the specific issue, in general the
validity of repl_down_since is limited to states that are different
than the state CONNECTED, and the disconnected time is set when the
state is DISCONNECTED. However from CONNECTED to other states, the state
machine must always go to DISCONNECTED first. So it makes sense to set
the field to zero (since it is meaningless in that context) when the
state is set to CONNECTED.
2018-07-03 12:42:14 +02:00
WuYunlong
59dd29bb35 fix server.repl_down_since resetting, so that slaves could failover
automatically as expected.
2018-06-30 09:39:08 +08:00
antirez
f2dfdae6a6 Fix type of argslen in sendSynchronousCommand().
Related to #5037.
2018-06-26 14:38:35 +02:00
antirez
bcbb9b71f8 Remove black space. 2018-06-26 14:37:22 +02:00
Madelyn Olson
755109f784 Addressed comments 2018-06-26 00:57:35 +00:00
Madelyn Olson
317e235b8d Fixed replication authentication with whitespace in password 2018-06-26 00:48:37 +00:00
shenlongxing
a35bf3f130 Fix write() errno error 2018-06-06 13:06:42 +02:00
Wander Hillen
58735ed74e Fix typos, add some periods 2018-03-16 09:59:14 +01:00
Salvatore Sanfilippo
ebf7964c62 Merge pull request #4269 from jianqingdu/unstable
fix not call va_end() when syncWrite() failed
2018-01-24 10:55:25 +01:00