- 积分
- 29
- 鸿鹄币
- 个
- 好评度
- 点
- 精华
- 最后登录
- 1970-1-1
- 阅读权限
- 10
- 听众
- 收听
网络小学徒

|
发表于 2014-4-9 13:39:18
|
显示全部楼层
奇怪怎么找不到了我再贴一下,不过是10年写的了,有什么写的对不对的希望大家探讨...
关于iscsi的multiple实现方案,以及vmware的esx下iscsi连接器的multiple的一些问题与解释
关于iscsi的multiple访问协议:
目前比较流行的有两个协议 mpio 和 mc/s
一下引用一篇对于mpio与mc/s的比较详细的介绍文章
重点看红字部分
MC/S vs MPIOMC/S (Multiple Connections per Session) is a feature of iSCSI protocol, which allows to combine several connections inside a single session for performance and failover purposes. Let's consider what practical value this feature has comparing with OS level multipath (MPIO) and try to answer why none of Open Source OS'es neither still support it, despite of many years since iSCSI protocol started being actively used, nor going to implement it in the future.
MC/S is done on the iSCSI level, while MPIO is done on the higher level. Hence, all MPIO infrastructure is shared among all SCSI transports, including Fibre Channel, SAS, etc.
MC/S was designed at time, when most OS'es didn't have standard OS level multipath. Instead, each vendor had its own implementation, which created huge interoperability problems. So, one of the goals of MC/S was to address this issue and standardize the multipath area. But nowadays almost all OS'es has OS level multipath implemented using standard SCSI facilities, hence this purpose of MC/S isn't valid anymore.
Usually it is claimed, than MC/S has the following 2 advantages over MPIO:
1. Faster failover recovery.
2. Better performance.
Let's look how much true those claims are.
Failover recovery timeLet's consider a single target exporting a single device over 2 links.
For MC/S failover recovery is pretty simple: all outstanding SCSI commands reassigned to another connection. No other actions are necessary, because session (i.e. I_T Nexus) remains the same. Consequently, all reservations and other SCSI states as well as other initiators connected to the device remain unaffected.
For MPIO failover recovery is much more complicated. This is because it involves transfer of all outstanding commands and SCSI states from one I_T Nexus to another. The first thing, which initiator will do for that is to abort all outstanding commands on the faulted I_T Nexus. There are 2 approaches for that: CLEAR TASK SET and LUN RESET task management functions.
CLEAR TASK SET function aborts all commands on the device. Unfortunately, it has limitations: it isn't always supported by device and having single task set shared over initiators isn't always appropriate for application.
LUN RESET function resets the device.
Both CLEAR TASK SET and LUN RESET functions can somehow harm other initiators, because all commands from all initiators, not only from one doing the failover recovery, will be aborted. Additionally, LUN RESET resets all SCSI settings for all connected initiators to the initial state and, if device had reservation from any initiator, it will be cleared.
But the harm is minimal:
* With TAS bit set on Control Mode page, all the aborted commands will be returned to all affected initiators with TASK ABORTED status, so they can simply immediately retry them. For CLEAR TASK SET if TAS isn't set all affected initiators will be notified by Unit Attention COMMANDS CLEARED BY ANOTHER INITIATOR, so they also can immediately retry all outstanding commands.
* In case of device reset the affected initiators will be notified via the corresponding Unit Attention about the device reset, i.e. about reset of all SCSI settings to the initial state. Then they can do the necessary recovery actions. Usually no recovery actions are needed, except for the reservation holder, whose reservation was cleared. For it recovery might be not trivial. But Persistent Reservations solve this issue, because they are not cleared by the device reset.
Thus, with Persistent Reservations or using CLEAR TASK SET function additional failover recovery time, which MPIO has comparing to MC/S, is time to wait for reset or commands abort finished and time to retry all the aborted commands. On a properly configured system it should be less than few seconds, which is well acceptable on practice. If Linux storage stack improved to allow to abort all submitted to it commands (currently only wait for their completion is possible), then time to abort all the commands can be decreased to a fraction of second.
PerformanceAt first, neither MC/S, nor MPIO can improve performance if there is only one SCSI command sent to target at time. For instance, in case of tape backup and restore. Both MC/S and MPIO work on the commands level, so can't split data transfers for a single command over several links. Only bonding (also known as NIC teaming or Link Aggregation) can improve performance in this case, although also with limitations, because it works on the link level.
MC/S over several links preserves commands execution order, i.e. with it commands executed in the same order as they were submitted. MPIO can't preserve this order, because it can't see, which command on which link was submitted earlier. Delays in links processing can change commands order in the place where target receives them.
Since initiators usually send commands in the optimal for performance order, reordering can somehow hurt performance. But this can happen only with naive target implementation, which can't recover the optimal commands execution order. Currently Linux is not naive and quite good on this area. See, for instance, section "SEQUENTIAL ACCESS OVER MPIO" in those measurements. Don't look at the absolute numbers, look at %% of performance improvement using the second link. The result equivalent to 200 MB/s over 2 1Gbps links, which is close to possible maximum.
If free commands reorder is forbidden for a device, either by use of ORDERED tag, or if the Queue Algorithm Modifier in the Control Mode Page is set to 0, then MPIO will have to maintain commands order by sending commands over only a single link. But on practice this case is really rare and 99.(9)% of OS'es and applications allow free commands reorder and it is enabled by default.
From other side, strictly preserving commands order as MC/S does has a downside as well. It can lead to so called "commands ordering bottleneck", when newer commands have to wait before one or more older commands get executed, although it would be better for performance to reorder them. As result, MPIO sometimes has better performance, than MC/S, especially in setups, where maximum IOPS number is important. See, for instance, here.
When MC/S is better than MPIOThere are marginal cases, where MPIO can't be used or will not provide any benefit, but MC/S can be successful:
1. When strict commands order is required.
2. When aborted commands can't be retried.
For disks both of them are always false. However for some tape drives and backup applications one or both can be true. But on practice:
* There are neither known tape drives, nor backup applications, which can use multiple outstanding commands at time. All them support and use only one single outstanding command at time. MC/S can't increase performance for them, only bonding can. So, in this case there no difference between MC/S and MPIO.
* The lack of ability to retry commands is rather a limitation of legacy tape drives, which support only implicit address commands, not of MPIO. Modern tape drives and backup applications can use explicit address commands, which you can abort and then retry, hence they are compatible with MPIO.
ConclusionThus:
1. Cost to develop MC/S is high, but benefits of it are marginal and with future MPIO improvements can be made negligible.
2. MPIO allows to utilize existing infrastructure for all transports, not only iSCSI.
3. All transports can benefit from improvements in MPIO.
4. With MPIO there is no need to create multiple layers doing very similar functionality.
5. MPIO doesn't have commands ordering bottleneck, which MC/S has.
Simply, MC/S is done on the wrong level. No surprise then that no Open Source OS'es neither support, nor going to implement it. Moreover, when back to 2005 there was an attempt to add MC/S in Linux, it was rejected. See for more details here and here.
If in future SCSI standards gain possibility to group several I_T nexuses with ability to reassign commands between them as well as preserve commands order among them, the above minor advantages of MC/S over MPIO will be removed and, hence, all investments in it will be voided.
以下是一篇中文的iscsi的mpio协议与mc/s协议的对比.
深入了解iSCSI的2种多路径访问机制
作者:存储在线
http://www.ccw.com.cn 2009-01-16 13:17:17 我要评论(0)
经过数年的发展后,iSCSI已成为IP SAN的代名词,大幅促进了存储局域网络(SAN)的普及应用。但相较于FC SAN,iSCSI虽然有建置价格低的优点,但也被认为存在着许多不足,包括易受攻击、可用带宽低,且缺乏高可用性的冗余访问机制等。
事实上,若有适当的环境配合,iSCSI也能实现多路径I/O(Multi-Path Input/Output,MPIO,多路径输入输出)与多重连结(Multiple Connections per Session,MC/S)两种多重路径存取的机制,可建立负载平衡、故障失效切换等带宽聚合应用,提供更可靠的存储网络环境。
SCSI的2种多路径访问机制
MPIO与MC/S都是利用多条实体存取通道,在服务器(iSCSI Initiator段)与存储设备(iSCSI Target端)之间建立逻辑通道,可透过轮替的存取动作,避免单一实体通道中断时,连带导致存取中断;或是平衡多个实体通道间的传输负载,避免传输负荷集中在单一实体通道上。但2种机制间又有所不同:
多重路径MPIO
MPIO可允许1个iSCSI Initiator透过多个Session连入同1个iSCSI Target设备,以便利用多网卡或iSCSI HBA启用负载平衡与故障失效切换机制,也可称作Multiple Sessions per Initiator。
多重连接MC/S
MC/S可允许在同一个Session中,在iSCSI Initiator与iSCSI Target间建立多个TCP/IP连接,同样也能让用户利用多张网卡或iSCSI HBA启用负载平衡与故障失效切换机制。
简而言之,MPIO是在更高的网络堆栈层上运作(即在iSCSI层上的SCSI指令层),且多条存取路径间的负载平衡机制,是针对1个指定的独立逻辑驱动器(LUN)运作;而MC/S则是iSCSI RFC中所定义的方法,是在iSCSI层上运作,具有更好的传输验证能力(Error Recovery Levels),另外MC/S的负载平衡是“同时”针对所有的逻辑驱动器运作,这点也与MPIO不同。
实现多路径存取的3种方式
iSCSI的底层是IP与以太网,理论上可直接从网卡实施,利用Port Trunking/Teaming/Link Aggregation的方式,将主机上的多张网卡捆绑在1个IP下,再连接到iSCSI存储设备上,搭配iSCSI存储设备传输埠上的对应设定,从而实现实体多路径连接。但问题在于,不是所有网卡都能支持这种方式。
另外,有一些存储厂商提供的SAN路径管理软件,也能协助用户建立iSCSI多路径存取环境,如EMC PowerPath、HDS的Hitachi Dynamic Link Manager、NetApp SnapDrive、Infortrend EonPath等,但这些软件通常都只支持特定厂牌型号的iSCSI设备。
不过我们也可跳过网卡这一层,亦无须使用路径管理软件,直接利用iSCSI Initiator软件配合iSCSI设备建立多路径存取。要利用这种方式建立MPIO还是MC/S,都必须满足一定条件:
(1)iSCSI Initiator端需有多张网卡或网络端口连接到Target端。
(2)iSCSI Initiator软件需支持MPIO或MC/S。
(3)iSCSI Target设备需支持MPIO或MC/S。
其中第1项是最基本的条件,主机若没有2个以上的网络端口(或2张以上网卡/iSCSI HBA)可用,自然谈不上多路径存取,不过目前多数服务器都内建了至少2组GbE网络端口,这点通常不会成为太大问题。
第2项条件则视不同环境而定,目前各主要操作系统厂商提供的iSCSI Initiator软件中,目前以Windows的支持较为齐全,如微软的iSCSI Initiator 2.06版以后就能支持MPIO与MC/S;Sun则有OpenSolaris MPxIO程序可支持Solaris环境的MPIO;Linux环境同样也能支持MPIO。
而就第3点来说,目前MPIO远比MC/S普及,大多数iSCSI存储设备都能支持MPIO,只要能允许同一个iqn建立的session即可。
但能支持MC/S的产品就少了许多,在软件式的iSCSI Target方面,目前能支持的也不多,如微软的iSCSI Target、Sun的Solaris iSCSI Target都不支持。
个人的理解:
mpio:Multi-Path Input/Output,MPIO,多路径输入输出.MPIO可允许1个iSCSI Initiator透过多个Session连入同1个iSCSI Target设备,以便利用多网卡或iSCSI HBA启用负载平衡与故障失效切换机制,也可称作Multiple Sessions per Initiator。
简单来说,mpio就是允许你的iscsi initiator与iscsi target之间通过多个session来进行连接.而性能的提示就是依靠利用多个session来承载数据的传输来实现的.
一下引用chad blog中对于mpio的解释:
Making this visual… in the diagram above, while in iSCSI generally you can have multiple “purple pipes” each with one or more “orange pipes” to any iSCSI target, and use MPIO with multiple active paths to drive I/O down both paths.
mc/s:Multiple Connections per Session,相对于mpio,mc/s可以在一个session中允许多个connection,也就是说在一个session连接可以由多个tcp网络连接进行承载.
以下引用chad的blog:
You can also have multiple “orange pipes” (the iSCSI connections) in each “purple pipe” (single iSCSI session) - Multiple Connections per Session (which effectively multipaths below the MPIO stack), shown in the diagram below.
现在iscsi设备中,多数是支持mpio的,但是支持mc/s的设备则相对较少.
比如openfiler就不支持mc/s.在之前引用的那篇中文文档中说sun的Solaris不支持mc/s,不过最近的solari似乎可以用命令修改iscsi target的max connection数也就是可以支持mc/s
从vshpere 4.0开始esx也可以支持iscsi的mpio了,但是依然不支持mc/s
以下引用chad的原话
Now, this behavior will be changing in the next major VMware release. Among other improvements, the iSCSI initiator will be able to use multiple iSCSI sessions (hence multiple TCP connections). Looking at our diagram, this corresponds with “multiple purple pipes”for a single target. It won’t support MC/S or “multiple orange pipes per each purple pipe” – but in general this is not a big deal (large scale use of MC/S has shown a marginal higher efficiency than MPIO at very high end 10GbE configurations) .
这也能解释为什么许多设备不支持mc/s
关于esx 3.5下iscsi的multiple.
首先说明下,esx3.5下,iscsi initiator的速度是没有办法通过multiple访问的方法来实现带宽的提升的.esx 3.x既不支持mpio也不支持mc/s
以下引用chad的
But in the ESX software iSCSI intiator case, you can only have one orange “pipe” for each purple pipe for every target (green boxes marked 2), and only one “purple pipe” for every iSCSI target. The end of the “purple pipe” is the iSCSI intiator port – and these are the “on ramps” for MPIO
So, no matter what MPIO setup you have in ESX, it doesn't matter how many paths show up in the storage multipathing GUI for multipathing to a single iSCSI Target, because there’s only one iSCSI initiator port, only one TCP port per iSCSI target. The alternate path to the gets established after the primary active path is unreachable. This is shown in the diagram below.
VMware can’t be accused of being unclear about this. Directly in the iSCSI SAN Configuration Guide: “ESX Server‐based iSCSI initiators establish only one connection to each target. This means storage systems with a single target containing multiple LUNs have all LUN traffic on that one connection”, but in general, in my experience, this is relatively unknown.
This usually means that customers find that for a single iSCSI target (and however many LUNs that may be behind that target – 1 or more), they can’t drive more than 120-160MBps.
As discussed earlier, the ESX 3.x software initiator really only works on a single TCP connection for each target – so all traffic to a single iSCSI Target will use a single logical interface. Without extra design measures, it does limit the amount of IO available to each iSCSI target to roughly 120 – 160 MBs of read and write access.
This design does not limit the total amount of I/O bandwidth available to an ESX host configured with multiple GbE links for iSCSI traffic (or more generally VMKernel traffic) connecting to multiple datastores across multiple iSCSI targets, but does for a single iSCSI target without taking extra steps.
大概的意思是说,在esx 3.x的环境下,你没有办法通过对iscsi initiator的配置来实现你到某一个iscsi target的连接带宽的提升.因为esx 3.x下到某一个特定的iscsi target只允许有一条活动的tcp连接.
所以,通常来说,esx 3.x到一个iscsi target的连接带宽大概就是160MB/s.
但是不是说你的esx 到iscis存储设备的总的带宽就被限制在160MB/s这个数字上.
因为你可以通过连接到不同的iscsi target来提高你的总的带宽.
也就是说你到一个iscsi target的速度限制是160MB/s,但是你可以连接到多个不同的iscsi target来分散你的负载.
Question 1: How do I configure MPIO (in this case, VMware NMP) and my iSCSI targets and LUNs to get the most optimal use of my network infrastructure? How do I scale that up?
Answer 1: Keep it simple. Use the ESX iSCSI software initiator. Use multiple iSCSI targets. Use MPIO at the ESX layer. Add Ethernet links and iSCSI targets to increase overall throughput. Ser your expectation for no more than ~160MBps for a single iSCSI target.
Remember an iSCSI session is from initiator to target. If use multiple iSCSI targets, with multiple IP addresses, you will use all the available links in aggregate, the storage traffic in total will load balance relatively well. But any individual one target will be limited to a maximum of single GbE connection's worth of bandwidth.
Remember that this also applies to all the LUNs behind that target. So, consider that as you distribute the LUNs appropriately among those targets.
The ESX initiator uses the same core method to get a list of targets from any iSCSI array (static configuration or dynamic discovery using the iSCSI SendTargets request) and then a list of LUNs behind that target (SCSI REPORT LUNS command).
Question 2: If I have a single LUN that needs really high bandwidth – more than 160MBps and I can’t wait for the next major ESX version, how do I do that?
Answer 2: Use an iSCSI software initiator in the guest along with either MPIO or MC/S
This model allows the Guest Operating Systems to be “directly” on the SAN and to manage their own LUNs. Assign multiple vNICs to the VM, and map those to different pNICs. Many of the software initiators in this space are very robust (like the Microsoft iSCSI initiator). They provide their guest-based multipathing and load-balancing via MPIO (or MC/S) based on the number of NICs allocated to the VM.
As we worked on this post, all the vendors involved agreed – we’re surprised that this mechanism isn't more popular. People have been doing it for a long time, and it works, even through VMotion operations where some packets are lost (TCP retransmits them – iSCSI is ok with occasional loss, but constant losses slow TCP down – something to look at if you’re seeing poor iSCSI throughput).
It has a big downside, though – you need to manually configure the storage inside each guest, which doesn’t scale particularly well from a configuration standpoint – so for most customers they stick with the “keep it simple” method in Answer 1, and selectively use this for LUNs needing high throughput.
There are other bonuses too:
* This also allows host SAN tools to operate seamlessly – on both physical or virtual environments – integration with databases, email systems, backup systems, etc.
* Also has the ability to use a different vSwitch and physical network ports than VMkernel allowing for more iSCSI load distribution and separation of VM data traffic from VM boot traffic.
* Dynamic and automated LUN (i.e. you don’t need to do something in Virtual Center for the guest to use the storage) surfacing to the VM itself (useful in certain database test/dev use cases)
* You can use it for VMs that require a SCSI-3 device (think Windows 2008 cluster quorum disks – though those are not officially supported by VMware even as of VI3.5 update 3)
There are, of course, things that negative about this approach.
* I suppose "philosophically" there's something a little dirty of "penetrating the virtualizing abstraction layer", and yeah - I get why that philosophy exists. But hey, we're not really philosophers, right? We're IT professionals, and this works well :-)
* It is notable that this option means that SRM is not supported (which depends on LUNs presented to ESX, not to guests)
Question 4: Do I use Link Aggregation and if so, how?
Answer 4: There are some reasons to use Link Aggregation, but increasing a throughput to a single iSCSI target isn’t one of them in ESX 3.x.
What about Link Aggregation – shouldn’t that resolve the issue of not being able to drive more than a single GbE for each iSCSI target? In a word – NO. A TCP connection will have the same IP addresses and MAC addresses for the duration of the connection, and therefore the same hash result. This means that regardless of your link aggregation setup, in ESX 3.x, the network traffic from an ESX host for a single iSCSI target will always follow a single link.
So, why discuss it here? While this post focuses on iSCSI, in some cases, customers are using both NFS and iSCSI datastores. In the NFS datastore case, MPIO mechanisms are not an option, load-balancing and HA is all about Link Aggregation. So in that case, the iSCSI solution needs to work in with concurrently existing Link Aggregation.
Now, Link Aggregation can be used completely as an alternative to MPIO from the iSCSI initiator to the target. That said, it is notably more complex than the MPIO mechanism, requiring more configuration, and isn’t better in any material way.
If you’ve configured Link Aggregation to support NFS datastores, it’s easier to leave the existing Link Aggregation from the ESX host to the switch, and then simply layer on top many iSCSI targets and MPIO (i.e. “just do answer 1 on top of the Link Aggregation”).
To keep this post concise and focused on iSCSI, the multi-vendor team here decided to cut out some of NFS/iSCSI hybrid use case and configuration details, and leave that to a subsequent EMC Celerra/NetApp FAS post.
In closing.....
I would suggest that anyone considering iSCSI with VMware should feel confident that their deployments can provide high performance and high availability. You would be joining many, many customer enjoying the benefits of VMware and advanced storage that leverages Ethernet.
To make your deployment a success, understand the “one link max per iSCSI target” ESX 3.x iSCSI initiator behavior. Set your expectations accordingly, and if you have to, use the guest iSCSI initiator method for LUNs needing higher bandwidth than a single link can provide.
Most of all ensure that you follow the best practices of your storage vendor and VMware.
下面有一段chad对网友问题的回答比较有价值
Chad,
I have a Celerra (NS20). When you say to have multiple iscsi targets, does each Target on the Celerra require a separate IP address (currently I have 3 targets all using the one IP address)? I've always wondered if adding more IP addresses to the targets would help with throughput. BTW, The celrra is connected to the switch via 2 x 1Gb ethernet ports using Etherchannel.
Great article. I'm trying to fix/improve I/O performance at my work and blogs like these are great.
Posted by: David | January 28, 2009 at 08:15 AM
Terrific Post Guys! I've had to explain most of this over and over and over again to my iSCSI clients who didn't have the knowledge of iSCSI, LACP, etc. and complained about their slow LUNs (thus perpetuating the "iSCSI is slow myth"), and after about a day of reconfiguration, dramatically increased their performance with the same SAN equipment.
I do have one point of contention is the discussion of there being no quantifiable benefit between LACP and just using MPIO. Of course you stated in the article that MPIO can take up to 60 seconds to fail over. However, in a properly configured LACP environment, failover is much quicker (on the order of 30ms-5sec) and transparent to VMWare (because it doesn't have to MPIO to another IP address) so the need to reconfigure guests with higher iSCSI timers is unnecessary. Of course, you can then have multiple LACPs with multiple IP address and load balancing and really ramp it up. So in that sense, LACP does have a quantifiable advantage over MPIO at this time and is a relatively KISS principle compliant solution assuming your switch and SAN support it in a painless way, although if the MPIO timers were tweakable I suppose you could possibly get the same result.
Another important point is that VMWare actually doesn't support LACP, which is the negotiation protcol for creating aggregate links. Instead, it only supports 802.3ad Static mode. Hopefully we'll get LACP support in ESX4, as that will help with both the setup learning curve (removing misconfigured ports from the trunk) and failover time.
My currently favorite config-du-jour is either the stackable Cisco 3750's or the Stackable Dell Powerconnect 6248's (which is a surprisingly good high performance, feature laden, and cheap L3 switch believe it or not) and 802.3AD cross-stack trunks from both the SAN targets (assuming it supports it) and the VMWare infrastructure.
Thanks for the excellent article guys! This definitely goes into my "read this first" pile for clients.
-Justin Grote
Senior Systems Engineer
En Pointe Global Services
thanks for the comments all!
David - Thank you for being an EMC/VMware customer! Hope you're enjoying your Celerra!
Each iSCSI targets, maps to one or more Network Portals (IP addresses). Now, unless you have more than one iSCSI target, all traffic will follow one network link from ESX - period (for the reasons discussed above). BTW - in the next VMware release, you can have multiple iSCSI sessions for a single target, and there are round-robin multipathing and the more advanced EMC PowerPath for VMware (which integrates into the vmkernel - very cool!)
But, for 3.5, you will see more throughput if you configure differently.
Your NS20 has 4 front end GbE ports, so you have a couple of simple easy choices that will dramatically improve your performance.
It depends on how you have configured your ESX server - are you using link aggregation to the ESX host to the switch, or multiple vSwitches? (this is something we need to add to the post) Let me know, and I'll repond...
UPDATE (1/31/09). David, I haven't heard from you, so will give the answer here for all, and also reach out to you directly.
Long and short - with 1 iSCSI target configured, you will never get more than 1 GbE connection's worth of throughput. You need to configure multiple iSCSI targets.
Now, the Celerra is really flexible about how to configure an iSCSI target. You can have many of them, and each of them can have many network portals (IPs). BUT, since the ESX iSCSI software initiator cannot do multiple sessions per target, or multiple connections per target - in this case, create multiple iSCSI targets - at least as many as you have GbE interfaces used for vmkernel traffic on your ESX cluster. Each needs a seperate IP address by definition.
By balancing the LUNs behind the iSCSI targets you will distribute the load.
You have used 2 of the 4 GbE interfaces on your Celerra (there are 4 per datamover, and the NS20 can have two datamovers - the Celerra family as a whole can scale to many datamovers).
SO, your choice is either to plug in the other two, assign IP addresses, and assign iSCSI targets (just use the simple iSCSI target wizard)
OR
The Celerra can have many logical interfaces attached to each device (where a device is either a physical NIC or aggregated/failover logical device). You could alternatively just create another logical IP for the existing 2 linked interfaces, and assign the IP address to that.
Now, you also need to consider how you will loadbalance from the ESX servers to the switch.
You can either:
a) use link aggregation (which will do some loadbalancing since there will be more than one TCP session, since you have more than one iSCSI target) - make sure to set the policy to "IP hash"
b) use the ESX vmkernel TCP/IP routing to loadbalance - here you have two vswitches, each with their own VMkernel ports on seperate subnets, and then you need to have the iSCSI target IP addresses on seperate subnets. This ensures even loadbalancing.
Let me know if this helps!!!
Posted by: Chad Sakac | January 29, 2009 at 10:26 AM
所以说基本上来讲,在exs 3.x下.很难使esx 服务器通过iscsi连接到某一个lun的速度高于160MB/s
不过到里vshpere 4.0下则事情发生了一些变化.
vshpere 4.0下esx的iscsi initiator正式支持mpio.你可以通过对两条线路的round robin的使用来提高你对于某一个iscsi target的访问速度.
不过在默认的情况下,不是这样的.
下面有一篇关于vshpere 4.0下iscsi多路访问呢的文章.
iSCSI multipathing with esxcli! Exploring the next version of ESX
Posted by Duncan Epping in March 18th, 2009
Published in Server
In the “Multivendor post to iSCSI” article by Chad Sakac and others(Netapp, EMC, Dell, HP, VMware) a new multi-pathing method for iSCSI on the next version of ESX(vSphere) had already been revealed. Read the full article for in depth information on how this works in the current version and how it will work in the next version. I guess the following section sums it:
Now, this behavior will be changing in the next major VMware release. Among other improvements, the iSCSI initiator will be able to use multiple iSCSI sessions (hence multiple TCP connections).
I was wondering how to set this up and it’s actually quite easy. You need to follow the normal guidelines for configuring iSCSI. But instead of binding two nics to one VMkernel you create two(or more) VMkernels with a 1:1 connection to a nic. Make sure that the VMkernels only have 1 active nic. All other nics must be moved down to “Unused Adapters”. Within vCenter it will turn up like this:
After you created your VMkernels and bound them to a specific nic you would need to add them to your Software iSCSI initiator:
esxcli swiscsi nic add -n vmk0 -d vmhba35
esxcli swiscsi nic add -n vmk1 -d vmhba35
esxcli swiscsi nic list -d vmhba35
(this command is only to verify the changes)
If you check the vSphere client you will notice that you’ve got two paths to your iSCSI targets. I made a screenshot of my Test Environment:
And the outcome in ESXTOP(s 2 n), as you can see two VMkernel ports with traffic:
There’s a whole lot more you can do with esxcli by the way, but it’s too much to put into this article. The whole architecture changed and I will dive into that tomorrow.
其实这部分命令在4.0的iscsi配置文档中就有详细的介绍.这里他的理解有一定的问题.实际上这个配置对于iscsi的mpio并不是必须的.这个命令只是可以让你在不想配置多个vswitch的情况下.通过esxcli命令对软件iscsi initiator绑定多个网卡的方法来实现多路径.
如果你可以接受用不同的网卡绑定到不同的vswitch的做法.则这个步骤不是必要步骤.
最重要的是默认情况下.即使你在esx的存储设备配置中启用了到某个iscsi target的多个路径的round robin的工作方式.实际的iscsi的吞吐量还是没有实际的增长.实际速度还是和你使用一条路径的时候相当.
引用vmware论坛上的一段回复
I have the same problem as Gamania, but it can't be a switch problem as I am using Peer-2-Peer cabling. I have 3 Peer-2-Peer cables and created 3 VMkernel Ports, with 3 vmnics, 3 vswitches and 3 IP-addresses.
This is what happens when I use fixed Paths.
You clearly see that it uses only 1 of the 1Gb links, wich is normal if you take a look at this screenshot:
When set to Round Robin, it turns to this (attached round-robin-esx.jpg) and you would expect that all 3 links will use each 1Gb so I get 3 times the througput.
However each Gigabit link uses 33% wich soms up to once again 1Gb (round-robin.jpg), is there some setting to push it past that boundary?
I do use IOMeter and when I use the EXACT same setup and add 3 VNics to my Virtual Machine, and I use Microsoft MPIO then it gets to 330MBps just fine...
Greetings and thanks in advance
反映的就是这个问题,即使配置了mpio和round robin.几个网络连接总的使用率还是相当于一条线路的.
注:后来参加vmware vcp4.0考试的时候,查看资料发现,vsphere 4.0下,roundrobin模式下同一时间点上用来传输数据的路径只有一条.
这里要整理一下思路:
之前可能由于理解不够深入存在一些误读的成分.
vmware的mpio策略分3种:
MRU:MOST RECENTLY USED
ROUNDROBIN:轮询
FIXED:明确的故障切换
引用vmware官方文档中的段落
设置路径选择策略
对于每个存储设备,ESX 主机都将根据 /etc/vmware/esx.conf 文件中所定义的声明规则来设置路径选择策略。
默认情况下,VMware 支持以下路径选择策略。如果在主机上安装了第三方的 PSP,其策略也将显示于列表中。
固定的 (VMware) 当通往磁盘的首选路径可用时,主机将始终使用此路径。如果主机无法通过首
选路径访问磁盘,它会尝试替代路径。“固定的”是主动-主动存储设备的默认
策略。
最近使用 (VMware) 主机使用磁盘的路径,直到路径不可用为止。当路径不可用时,主机将选择替
代路径之一。当该路径再次可用时,主机不会恢复到原始路径。没有 MRU 策
略的首选路径设置。MRU 是主动-被动存储设备的默认策略并且对于这些设备
是必需的。
循环 (VMware) 主机使用自动路径选择算法轮流选择所有可用路径。这样可跨所有可用物理路
径实现负载平衡。
负载平衡即是将服务器 I/O 请求分散于所有可用主机路径的过程。目的是针对
吞吐量(每秒 I/O 流量、每秒兆字节数或响应时间)实现最佳性能。
其实,这里你要清楚一点;实际上,无论你选择三种策略的哪一种,在一个瞬时点数据的传输还是通过1条path的.
所以,相对于其他两种策略,虽然你在vi client中看到roundrobin模式下,存储路径中的列表中所有的path都是使用中的状态.但是这个时候你心中要明白,这两个路径不是同时使用的,roundrobin是轮流使用不同的path路径来处理i/o请求.你的i/o处理请求不会在两条物理线路中同时进行传输的.
以下是另外一个人的解释.
Could you tell what type of storage are you using?.
Try setting the RR params like below:
esxcli nmp roundrobin setconfig --device --iops 1 --type iops
Unfortunately there is no UI support for above. Let us know what do you get.
运行命令的结果.
Brilliant! This worked like a charm, I get 309.38MBps now. Check screenshot for Network Usage at the iSCSI Target side (Windows 2008 Storage Server with WinTarget iSCSI Target Software).
The storage array was an (experimental) 5 Disk RAID0 of 5 times Intel SSD X25-E (32GB). When I perform a DAS run I get over 600MBps throughput, so I'm pretty sure my bottleneck will be at the network and the ESX4 hypervisor overhead.
I'm researching Software iSCSI inside ESX4 over iSCSI in the Guest OS as they both support MPIO & jumbo frames right now. Howerver Jumbo Frames were possible in ESX3.5 (through cli) I did not found to get MPIO working in ESX3.5 update 4...
Can you tell me what was the problem with the first setup, what did the command you described do?
It looks like it sets the I/O Operation Limit to 1 and sets the Limit Type to Iops (as I can tell from the esxcli nmp roundrobin getconfig --device [I] command).
Appearantly it switches over to the next path after only 1 I/O operation, but I do not see how this helps with the 1Gb limit?
Thanks in advance...
这时解释
ESX RR policy does round-robin across multile active paths based on usage. There is only one active IO path at a given time and it sends down certain number of IO's before switching to another active path. The default path switching is based on "number of IOs" and default value for that is 1000. The above command changed that value to 1. Now, it worked great for your config but may yield different result for different storage. Switching paths for every other IO may totally trash the cache on the storage processor side.
在一个给定的时间点上,只有一个i/o路径是可用的.它发送一定数量的io请求后切换到另外一个可用的链路上.而默认的情况下是每发送1000个请求后,切换一次链路,而那个命令的意思是想这个数字修改为1.
|
12#
2014-4-9 13:39:18
回复(0)
收起回复
|