Yeah I've been happy with DarkSky as a user for awhile and, when it came time to replace wunderground in BryceBot (they shut-down their API), DarkSky was my first choice.
So ARP Networkians, I'm looking for some advice/info on virtualization and storage. What are thoughts on building a Ceph cluster vs buying something like a TrueNAS that has a fancy multi-year warranty (next-day replacement parts etc), support, a nice GUI etc?
The appeal of a TrueNAS is that I'm already familiar with FreeNAS, I like ZFS, and someone else can maintain it if I've moved on from the company (or otherwise, not around), and it has a single company providing Enterprise(tm) support (warranty, support line for whoever has to take over, etc). Oh and it's built and burned-in to high hell before it's shipped to us.
Cons: Cost.
Ceph is fairly unknown to me and I'm still reading-up to understand things and I wanted to ask ARP (up_the_irons et al) who did their own migration to Ceph and rely on it now.
* What's management like? Is it making API calls directly (from curl), some other CLI commands, is there a GUI that could be used by "tier 1 helpdesk"?
no experience with ceph, but truenas stuff is really nice - had ~100TB across two chassis w/extensions at work for archival storage
(Lol I got interrupted by a conference call)
not sure if it's worth the cost premium over rolling your own cluster unless you expect hardware failures or something (never had to use the warranty, but the folks @ ixsystems were great during sales and whatnot)
m0unds: Something to keep in mind is that I work 100% remote so I'm not the one building and installing anything. Having something delivered "ready to go" is a plus.
oh, okay, in that case, completely worth it
haha
The price is pretty steep for our budget so I'm looking for alternatives
(con call at 1h20 now...)
* What about "enterprise support"? Is there some company we can say "hey, help us and replace anything that breaks for the next 3 years"? Or would that just be ourselves/whatever hardware vendor we buy from (if we buy something from HP/Dell/Lenovo)
* How is capacity and expansion? How is capacity planning "handled"? If I want 20TB with some modicum of redundancy, am I building two+ servers with 5x4TB drives (for instance)? Or maybe 4 servers at 10TB+ each? And if I need more storage, just... build another and add it?
* How's resource usage/planning? Is Ceph particularly memory-hungry (as ZFS is known to be)? Is IO CPU-bound, disk- or network-bound?
(To clarify: Those last couple of questions can be found online but I'm interested in ARP's experience specifically.)
btw m0unds just curious if you recall and are willing to share - What did that cost you, rounded to the nearest $10k? What sort of high-availability did that support? Was it being used for VM disks (requiring super-low latency and relatively high IOPS) or something less performance-sensitive?
Right now my biggest concern is that I'll spend a 5-digit chunk of money on something that doesn't perform up to snuff, effectively repeating the mistakes of past admins here. And I'll be doubly embarrassed if there was a cheaper, better alternative like Ceph that I had just skipped over. (And naturally, there's a time pressure to all this - the sooner it's deployed, the better, yesterday would be ideal)
* What sort of monitoring exists, either implemented in Ceph, or on top of Ceph, or beneath Ceph, for things like node failures or, more concerning to me, drive failures?
brycec: I want to say it was $14-18k per device, and this was back in 2013 or so; use was purely archival, so very low moment-to-moment demand. just dump stuff, then replicate one unit to the other via network (one onsite, one offsite)
m0unds: Ah cool, thank you!
looking to see whether i might still have the old outlook pst in a backup
here we go, found it
brycec: $15k nearest round number per chassis + $2k per unit for extended software support, 3yr adv replacement, external sas kit
that was w/4TB 7200rpm SAS disks circa 12/2013
Their prices haven't changed terribly much, which is a little disappointing. 44TB raw is about $11k (I don't remember what drives)
The more I'm reading on Ceph, the more excited I am for it as a technology, and wishing I had hardware laying around to build up a testbed.
I'm real curious about getting a few chassis from 45Drives (has "fancy warranty"), or maybe just some SuperMicro stuff and building from scratch
Price isn't a whole lot better (45Drives would be about $8k/chassis w/o drives), but... tempting all the same.
More Ceph questions for ARP: What does ARP's Ceph architecture look like? Are the VM hosts also running monitors and/or OSD, or are those on isolated/dedicated machines? What is ARP's expansion plan, replace existing OSD/disks with larger ones, add a new server full of disks when you reach some watermark, add a new server with a few disks and just steadily add disks as capacity requirements grow (or
something else)?
(Oh and insert "MDS" wherever applicable in my question, such as "running monitors and/or MDS and/or OSD". Forgot about that.)
wow, prices are still that high? yeesh
lol, "the storinator" @ 45drives
brycec: storage hosts and vm machines are separate.  we did the migration ourselves.
main thing about ceph is you want at least 4 servers as a starting point really
as you want 3 way replication, and you want to be able to have one host go down
reads normally comes from one of the 3 way rather than split.  meaning your read performance of your drives has quite a big correlation with your read speeds for data.
that said with readahead that can change a bit.  and if you do parallel requests to different non 4mb areas etc.
i did a test bed myself first, and lots of reading..
how's the cluster connected to vms? gbe?
infiniband
cool
are the spinning disks like 15krpm sas or something?
we've got mix of storage
that is one of the hardest things to decide when starting out
what is your performance needs, and how wide do you want to go with what disks to achieve that
one cool thing about ceph vs local storage is that it's much harder for single clients to impact other users
where on local storage some nasty neighbours can really impact you.
right, contending for local disk i/o
also you can move/reboot single servers
so like when we moved our cluster we moved it one server at a time..
io perf is really good on both ssd and spinning disk w/the "thunder" vms - that's why i was curious
it's good not great
i mean it parallelises well
linux does some readahead at least
the spinning disk pool is also quite wide
like i think that's probably one of the things that people underestimate the need of for ceph when coming from traditional storage...
you don't really want to have 4 storage servesr, you want to have 10
so it's better to have 10x10 disks host sthan 4x25 disk hosts
my test setup was 4 nodes
"also you can move/reboot single servers" hence having 4+ (with copies=3). If you only had 3 servers and copies=3, it would need to rebalance once the third came back, yeah?
mercutio: What's management like for having many (almost?) identically-configured OSD hosts? Like, what kind of maintenance is there to do beyond applying OS updates, anything? Or does Ceph just keep on ticking for the most part?
well you can set minimum copies to 2
if you want to test with a couple of hosts you can set minimum copies to 1 even
and then it can still write
if you don't have enough to guarantee then it'll block all writes
umm maintenance isn't that bad.  if you want to go in depth probably should do a consult.
Oh right, ARP has a side-hustle lol. Definitely don't mean to impinge on that, and not looking to get too in-depth. It (maintenance) was just something I hadn't seen much mention of in my reading so far. And of course as a customer I'm always interested in the underlying infrastructure.
well i suppose the biggest advantage to maintenance is that you can fail disks and then replace them when you like.
so usually on local storage the disk will fail at some inopportune time and then you need to go swap it out.
(i mean when is an opportune time for a disk to fail?)
so as long as you have extra disks in the cluster (which you should) then you can replace them whenever
What happens when a disk fails, does the OSD process just crash or do you have to stop it manually?
it can fail it out or not.  occassionally a disk partially fails and needs to be forced out
but at least no dc visit :)
Ceph certainly seems much more flexible than a RAID or ZFS solution in that regard. Lose X disks with those and you're just hoping you don't lose anymore (barring hot-spares which are basically wasted space). Ceph, you can lose as many disks as you like so long as you still have the required $capacity * $copies
yeah if using zfs you can always have hot spares.  that's what we have in germany.
if you're using servers that can take a lot of disks then hot spares isn't as bad too
But those hot spares are literally wasting space in the chassis. I'd rather put it to work (in a still-resilient manner), not to mention I'd know if it were a dud much sooner than if I have to wait until the spare's needed and it craps out then.
yeh accounting for 3x makes it a bit harder to justify.
yeah
well in germany it's good because it's harder to replace the disks.
zfs isn't a crazy way to go for smaller setup
Also, I'm in love with the expandibility of Ceph. ZFS has hoops to jump through (adding additional vdevs of matching or greater capacity etc) which means detailed planning AND expense right now. Ceph seems like you can just throw disks/servers at it whenever you have spare capital.
so like if the idea of having 4 servers even seems extreme then zfs may be better :)
yeah
you don't have to match sizes
Ooh that too
Nah 4 servers doesn't sound extreme, though I'm strongly considering just 2 or 3 to start in order to ease the budget blow.
i'd say it'd be better to go second hand if budget is tight
2 or 3 isn't really enough to go with ceph
Yeah I'm writing an email now to get a feel from the people actually driving the decision/holding the purse strings on that.
https://www.instagram.com/p/BmuOhtcF9HE/
Instagram: "Some of our gear in Frankfurt, Germany\n.\n.\n.\n#datacenter #datacentre #datacenters #server #servers #cloudcomp #serverroom #serverrack #sysadmin #tech #technology #techgeek #wiring #cat5 #cabling #cables #wires #behindthescenes #networking #networkengineer #cableporn #datacenterporn #switches #arpnetworks #cisco" by arpnetworks
Ceph definitely seems better suited to the cheap off-the-shelf (and eBay) approach / I can get away with stuff that doesn't have a "fancy warranty" since there's more hardware with redundancy from the get-go. (If 1 of 10 cheap Dell eBay servers dies, so what, buy a replacement and install it when there's time)
so kzt01/kzt02 each take 25 disks
Are they VM hosts? They look like they're also VM hosts (based on the naming scheme)
yeah they're our zfs ones
which are local storage
Ahhh so Frankfurt != Ceph
it's harder to justify in germany building a ceph cluster
yeah
Makes sense
it's got ssd zil though
we haven't had any disk performance issues in germany either
having lots of disks helps that hah
but yeah, so like for instance you could do 8x25 disk hosts
and then find you have 200 disks
those are 2.5" though
VM is different from archival storage though
4tb disks is too big for VM, you're going to crash and burn.....
you get a lot more read/write load with virtual machine hosting compared to backup/archival space or such
Meanwhile in the states https://www.instagram.com/p/BwoAGuLAvDe/ I'm gathering SCT* are VM hosts, and the couple of chassis full of 2.5" disks are Ceph storage?
Instagram: "More migrated gear #datacenter #datacentre #datacenters #server #servers #cloudcomp #serverroom #serverrack #sysadmin #tech #technology #techgeek #wiring #cat5 #cabling #cables #wires #behindthescenes #networking #networkengineer #cableporn #datacenterporn #switches #arpnetworks #hp #supermicro" by arpnetworks
yeah sounds about right
there's no 2.5" disks in that picture
What are the HP machines below, unlabeled, do you know?
think they're hp g8s
I meant what are they used for lol
oh VM
you can half read on the side
zoom in and you can read
kct13, kct12, kct11, kct10
from top to bottom
Well at least I'm pretty sure this is all Ceph given the tags, lol https://www.instagram.com/p/BwhyfBHAJUd/
oh they are 2.5"
Instagram: "More in progress Ceph migration, then finally complete! #datacenter #datacentre #datacenters #server #servers #cloudcomp #serverroom #serverrack #sysadmin #tech #technology #techgeek #wiring #cat5 #cabling #cables #wires #behindthescenes #networking #networkengineer #cableporn #datacenterporn #switches #arpnetworks #hp #supermicro" by arpnetworks
yeah that's hard to read too :)
I really like the full view in https://www.instagram.com/p/BwDSI68ASEX/ (thanks up_the_irons )
Instagram: "Some of our gear from outside the cage\n#datacenter #datacentre #datacenters #server #servers #cloudcomp #serverroom #serverrack #sysadmin #tech #technology #techgeek #wiring #cat5 #cabling #cables #wires #behindthescenes #networking #networkengineer #cableporn #datacenterporn #switches #arpnetworks #supermicro #hp #apc" by arpnetworks
(I realize that's pre-move, yes)
It gives an idea of the scale of VM hosts and storage hosts
yeah
we're more dense for vm hosts than storage
because vm hosts don't need as much space for the disks :)
i think things have been pretty smooth with ceph/kvm
but people onyl really mention performance ors uch if things are terrible
And I gather you're gradually moving to those 1U HP G8s, replacing the older hosts that still did local storage.
no kvm hosts in los angeles use local storage for customers
so yeah that's actually done :)
which made the move easier
I can only imagine. Thanks for letting me bend your ear, mercutio !
i've heard of a few other people doing ceph clusters now
it is growing in popularity
Random idea - Is there such a thing as "resellable" Ceph / Ceph as-a-service? Some way that a provider such as ARP could grant access to the overall cluster and let a customer create pools for themselves? Or does that just defy reason?
i don't think security model would be good for that in current ceph
you could do iscsi rbd or such
I framed that as an "ARP offering" to keep it mildly on-topic. I'm also thinking about for $work in a year or two's time -- I maintain the storage and grant my users (departments etc) access to a subset so they can manage their own VM crap.
(Less devops, more devs and ops)
the closest i can think that could work ok is iscsi rbd
but you'd still have to create iscsi devices per volume you want to export
you can actually potentially boot off iscsi rbd
now i want to try that ;)
http://docs.ceph.com/docs/mimic/rbd/iscsi-overview/
K yeah, that doesn't quite fill the hole, but not the end of the world. So long as I continue maintaining the Proxmox servers (and the permissions in Proxmox) they can create/destroy as many VMs as they need without directly managing/handling RBD creation (Proxmox's GUI is pretty much "How much space do you need? Okay, done." and it creates the requisite RBD in the pool etc etc)
well i suppose you don't nede to have an iscsi volume per vm
That's pretty cool though!
you could just have a fs on top
and then that'd work with proxmox
Proxmox will let you build a whole zpool on top of iSCSI even
(I'm sure there must be a scenario where that makes sense... I just know it's doable)
yeah that sounds like fun
the main thing is getting more than gbe to hosts...
i suppose you could always use ib for it
Yeah we're planning to standardize on 10GbE RJ45 (why RJ45? I don't know)
i wouldn't go for rj45...
sfp+ or ib
ib is qsfp
so is 40gbe
That decision wasn't mine :p Though I do remember why they did... because they don't have a 10GbE switch yet and wanted to be able to use it on a GbE switch. *sigh*
some people have had issues with 10gbe and performance btw
Ugh now you tell me :p
heh now is when you mentioned it :)
10gbe to clients isn't so bad
What sort of performance issues? What should I be googling?
but if you're sharing that 10gbe to clients with 10gbe between nodes
then it can congest
you could just get arista 40gbe switch?
and 40gbe cards?
We'll see. Money's already been spent...
ah
This project hasn't been terribly well coordinated from the get-go. I found out I was responsible for anything a week after they'd ordered VM hosts and asked me where my part was.
hah
so you have to come up with storage solution?
and they have vm hosts with 10gbe?
Very embarrassing to be legitimately saying "Oh, I didn't know I was supposed to do anything." (unlike in school when I'd use it as a bullshit lie)
yeh it sucks
it's not that uncommon though
becuase someone says that person will deal with it, then someone else says that person
and then everyone assumes that person knows this :)
Yep that's me apparently, responsible for managing all of it, but only responsible for purchasing storage. (And not the one installing any of it, physically) And yeah, the VM hosts were purchased with Intel 2x10GbE RJ45 NICs
No idea where the switch situation stands...
haha
mercutio: Quick question -- When a pool is configured with copies=3, does that a block is replicated on 3 different machines, or 3 different OSD (so if a machine had 3 disks and the block was only on those disks and the machine died, would the block be lost?)
(I haven't come across an answer to that yet -- or if I did, I didn't realize it, so I figured I'd ask)
depends what your policy is
normally people set to 3 different machiens
but you can set it to 3 different cages or such too
or 3 disks on one machine
http://docs.ceph.com/docs/mimic/rados/operations/crush-map/
brycec:
Ah it's configured in the CRUSH map, got it. Thanks again mercutio
(In my defense, I've glossed-over the CRUSH stuff as being a level of configuration I don't need to know/care about yet)
well the default crush map should be fine for small setup anyway
define "small" :P
small means all servers in one rack basically
like if you have 3 rooms with servers in them, each with power feeds to them there is the potential for a whole room to lose power, so you may want to do your redundancy across rooms so that one whole room can go down.
@w kaeg
ဧရာဝတီမြစ်, မကွေးခရိုင်, မကွေးတိုင်းဒေသကြီး, 10261, မြန်မာ: Partly Cloudy ☁ 38.2°C (100.8°F), Humidity: 30%, Wind: From the E at 0m/s Gusting to 1m/s. Visibility: 16km -- For more details including the forecast, see https://darksky.net/forecast/20.3142558,94.9191074
wat
it partially matched the name of a city instead of icao, lol
@w 87114
ABQ, New Mexico, 87114, USA: Overcast ☁ 55.0°F (12.8°C), Humidity: 51%, Wind: From the ENE at 2mph Gusting to 7mph. Visibility: 8mi -- For more details including the forecast, see https://darksky.net/forecast/35.191585586202,-106.67789964486